I.2.Linear Algebra

If you need just quick ref - see The Matrix CookBook. For full book about linear algebra - Shilov 1977 linear algebra

is just a single number.
is an array of numbers.
is a 2-D array of numbers, so each element is identified by two indices instead of just one.
an array with more than two axes.


The product operation of matrices(matrix product) \(C = AB\) is defined as \(C_{i,j} = \sum_{k} A_{i,k} B_{j,k}\).

Matrix containing the product of the individual elements called the element-wise product or Hadamard product, and is denoted as \(A \odot B\).

Matrix multiplication is distributive \(A(B + C) = AB + AC\).

Matrix multiplication is associative also \(A(BC) = (AB)C\).

Transpose matrix product \((AB)^{T} = B^{T} A^{T}\).

An identity matrix is a matrix that does not change any vector when we multiply that vector by that matrix. We denote the identity matrix that preserves n-dimensional vectors as \(I_{n}\).

\begin{equation*} \forall x \in \mathbb{R}, I_{n}x = x \end{equation*}

All the entries along the main diagonal in identity matrix is 1, while all other entries are zero.

The matrix inverse of \(A\) is denoted as \(A^{-1}\) , and it is defined as the matrix such that \(A A^{-1} = I_{n}\).


the point specified by the vector of all zeros.
Linear combination
of some set of vectors \({ v^{1} , … , v^{n} }\) is given by multiplying each vector \(v^{i}\) by a corresponding scalar coefficient and adding the results: \(\sum_{i}c_{i}v^{i}\).
of a set of vectors is the set of all points obtainable by linear combination of the original vectors.

A set of vectors is linearly independent if no vector in the set is a linear combination of the other vectors.

Only square matrix with linearly independent columns have determinant.

A square matrix with linearly dependent columns is known as singular.

For square matrices the left inverse and right inverse are equal \(AA^{-1}=I\).


are functions mapping vectors to non-negative values. Can be threated as size of the vector.

\(L^{p}\) norm is given by \(||x||_{p}=(\sum_{i}|x_{i}|^{p})^{1/p}\), for \(p \in \mathbb{R}, p \geq 1\).

Euclidean norm
the \(L^{2}\) norm, with \(p = 2\).

Squared \(L^{2}\) norm can be calculated simply as \(x^{T}x\).

\(L^{2}\) norm may be undesirable because it increases very slowly near the origin. In these cases, we turn to a function that grows at the same rate in all locations, but retains mathematical simplicity: the \(L^{1}\) norm. The \(L^{1}\) norm is commonly used in machine learning when the difference between zero and nonzero elements is very important.

\(L^{1}\) norm

The \(L^{\infty}\) norm, also known as the max norm. This norm simplifies to the absolute value of the element with the largest magnitude in the vector.

\(L^{\infty}\) norm
\(||x||_{\infty} = max_{i}|xi|\)

Most common way to measure the size of a matrix this is Frobenius norm which is analogous to the \(L^{2}\) norm of a vector.

Frobenius norm

The dot product of two vectors can be rewritten in terms of norms.

Dot product

where \(\theta\) is the angle between \(x\) and \(y\).

Special Kinds of Matrices and Vectors

Diagonal matrices consist mostly of zeros and have non-zero entries only alongthe main diagonal. We write \(diag(v)\) to denote a square diagonal matrix whose diagonal entries are given by the entries of the vector \(v\). To compute \(diag(v)x\), we only need to scale each element \(x_i\) by \(v_i\). In other words, \(diag(v)x = x \odot y\). The inverse exists only if every diagonal entry is nonzero, and in that case, \(diag(v)^{-1} = diag([1/v_1, …, 1/v_n]^T)\).

A symmetric matrix is any matrix that is equal to its own transpose: \(A = A^T\).

A unit vector is a vector with unit norm: \(||x||_2 = 1\).

A vector \(x\) and a vector \(y\) are orthogonal to each other if \(x^Ty = 0\). In \(\mathbb{R}^{n}\), at most \(n\) vectors may be mutually orthogonal with nonzero norm. If the vectors are not only orthogonal but also have unit norm, we call them orthonormal.

An orthogonal matrix is a square matrix whose rows are mutually orthonormal and whose columns are mutually orthonormal: \(A^TA = AA^T = I\). This implies that \(A^{-1} = A^T\).


Edeigen-decomposition decompose a matrix into a set of eigenvectors and eigenvalues. An eigenvector of a square matrix \(\pmb A\) is a non-zero vector \(\pmb v\) such that multiplication by \(\pmb A\) alters only the scale of \(\pmb v\):

\begin{equation*} \pmb{Av} = \lambda \pmb{v}. \end{equation*}

The scalar \(\lambda\) is known as the eigenvalue corresponding to this eigenvector. If \(\pmb v\) is an eigenvector of \(\pmb A\), then so is any rescaled vector \(\pmb{sv}\) for \(\pmb{s} \in \mathbb{R}, s \neq 0\). Moreover, \(\pmb{sv}\) still has the same eigenvalue. For this reason, we usually only look for unit eigenvectors.

Suppose that a matrix \(\pmb A\) has \(n\) linearly independent eigenvectors, \({v^{(1)}, … ,v^{(n)}}\), with corresponding eigenvalues \({\lambda_1, … , \lambda_n}\). We may concatenate all of the eigenvectors to form a matrix \(\pmb V\) with one eigenvector per column: \(\pmb V = [v^{(1)}, … ,v^{(n)}]\). Likewise, we can concatenate the eigenvalues to form a vector \(\pmb{\lambda}= [\lambda_1, … ,\lambda_n]^T\). The eigendecomposition of \(\pmb A\) is then given by:

\begin{equation*} \pmb A = \pmb V diag(\pmb{\lambda}) \pmb V^{-1} \end{equation*}

Not every matrix can be decomposed into eigenvalues and eigenvectors. Every real symmetricmatrix can be decomposed into an expression using only real-valued eigenvectors and eigenvalues:

\begin{equation*} \pmb A = \pmb{Q \wedge Q}^T \end{equation*}

where \(\pmb Q\) is an orthogonal matrix composed of eigenvectors of \(\pmb A\), and \(\pmb{\wedge}\) is a diagonal matrix. The eigenvalue \(\wedge_{i,i}\) is associated with the eigenvector in columni of \(\pmb{Q}\), denoted as \(\pmb{Q}_{:,i}\). Because \(\pmb{Q}\) is an orthogonal matrix, we can think of \(\pmb{A}\) as scaling space by \(\lambda_i\) in direction \(\pmb{v}^{(i)}\).

While any real symmetric matrix \(\pmb{A}\) is guaranteed to have an eigendecomposition, the eigendecomposition may not be unique. If any two or more eigenvectors share the same eigenvalue, then any set of orthogonal vectors lying in their span are also eigenvectors with that eigenvalue, and we could equivalently choose a \(\pmb{Q}\) using those eigenvectors instead. By convention, we usually sort the entries of \(\pmb{\wedge}\) in descending order. Under this convention, the eigendecomposition is unique only if all of the eigenvalues are unique.

The matrix is singular if and only if any of the eigenvalues are zero. The eigendecomposition of a real symmetric matrix can also be used to optimize quadratic expressions of the form \(f(\pmb{x}) = \pmb{x}^T \pmb{Ax}\) subject to \(||\pmb{x}||_2 = 1\). Whenever \(\pmb x\) is equal to an eigenvector of \(\pmb A\), \(f\) takes on the value of the corresponding eigenvalue. The maximum value of \(f\) within the constraint region is the maximum eigenvalue and its minimum value within the constraint region is the minimum eigenvalue.

A matrix whose eigenvalues are all positive is called positive definite. A matrix whose eigenvalues are all positive or zero-valued is called positive semidefinite. Positive semidefinite matrices are interesting because they guarantee that \(\forall \pmb x, \pmb{x}^T \pmb{Ax} \geq 0\). Positive definite matrices additionally guarantee that \(\pmb{x}^T \pmb{Ax} = 0 \Rightarrow \pmb x = 0\).

Singular Value Decomposition

The singular value decomposition (SVD) provides another way to factorize a matrix, into singular vectors and singular values. Every real matrix has a singular value decomposition, but the same is not true of the eigenvalue decomposition. For example, if a matrix is not square, the eigendecomposition is not defined, and we must use a singular value decomposition instead.

The singular value decomposition is similar to eigendecomposition, except this time we will write \(\pmb A\) as a product of three matrices:

\begin{equation*} \pmb A = \pmb{U DV}^T \end{equation*}

Suppose that \(\pmb A\) is an m x n matrix. Then \(\pmb U\) is defined to be an m x m matrix, \(\pmb D\) to be an m x n matrix, and \(\pmb V\) to be an n x n matrix.

Each of these matrices is defined to have a special structure. The matrices \(\pmb U\) and \(\pmb V\) are both defined to be orthogonal matrices. The matrix \(\pmb D\) is defined to bea diagonal matrix. Note that \(\pmb D\) is not necessarily square.

The elements along the diagonal of \(\pmb D\) are known as the singular values of the matrix \(\pmb A\) . The columns of \(\pmb U\) are known as the left-singular vectors. The columns of \(\pmb V\) are known as as the right-singular vectors.

We can actually interpret the singular value decomposition of \(\pmb A\) in terms of the eigendecomposition of functions of \(\pmb A\). The left-singular vectors of \(\pmb A\) are theeigenvectors of \(\pmb{AA}^T\). The right-singular vectors of \(\pmb A\) are the eigenvectors of \(\pmb{A}^T\pmb{A}\) . The non-zero singular values of \(\pmb A\) are the square roots of the eigenvalues of \(\pmb{A}^T\pmb{A}\) . The same is true for \(\pmb{AA}^T\) .

Moore-Penrose pseudoinverse

Suppose we want to make a left-inverse \(B\) of a matrix \(A\), so that we can solve a linear equation \(Ax = y\) by left-multiplying each side to obtain \(x = By\). Depending on the structure of the problem, it may not be possible to design a unique mapping from \(A\) to \(B\).

If \(A\) is taller than it is wide, then it is possible for this equation to have no solution. If \(A\) is wider than it is tall, then there could be multiple possible solutions. The pseudoinverse of \(A\) is defined as a matrix

\begin{equation*} A^{+} = \lim_{\alpha\to 0} (A^TA + \alpha I)^{-1}A^T \end{equation*}

Practical algorithms for computing the pseudoinverse are not based on this definition, but rather the formula

\begin{equation*} A^{+} = VD^{+}U^T \end{equation*}

where \(U\), \(D\) and \(V\) are the singular value decomposition of \(A\), and the pseudoinverse \(D^{+}\) of a diagonal matrix \(D\) is obtained by taking the reciprocal of its non-zero elements then taking the transpose of the resulting matrix.

When \(A\) has more columns than rows, then solving a linear equation using the pseudoinverse provides one of the many possible solutions. Specifically, it provides the solution \(x=A^{+}y\) with minimal Euclidean norm \(||x||_2\) among all possible solutions.

When \(A\) has more rows than columns, it is possible for there to be no solution. In this case, using the pseudoinverse gives us the \(x\) for which \(Ax\) is as close as possible to \(y\) in terms of Euclidean norm \(||Ax - y||_2\).

The Trace Operator

Trace operator gives the sum of all of the diagonal entries of a matrix:

\begin{equation*} Tr(A) = \sum_i A_{i, i} \end{equation*}

For example, the trace operator provides an alternative way of writing the Frobenius norm of a matrix:

\begin{equation*} ||A||_F=\sqrt{(Tr(AA^T))} \end{equation*}

Also: \(Tr(A) = Tr(A^T)\), and \(Tr(ABC) = Tr(CAB) = Tr(BCA)\).

The Determinant

The determinant of a square matrix, denoted \(det(A)\), is a function mapping matrices to real scalars. The determinant is equal to the product of all the eigenvalues of the matrix. The absolute value of the determinant can be thought of as a measure of how much multiplication by the matrix expands or contracts space. If the determinant is 0, then space is contracted completely along at least one dimension, causing it to lose all of its volume. If the determinant is 1, then the transformation preserves volume.