James Gregson's Website

Nov 15, 2018

Reordering Matrix Products

Here are a few operations for dealing with objective functions of matrix-valued variables:

Starting Definitions

Let \(A\) be a \(N\times M\) matrix with entry \(A_{ij}\) being the entry at row \(i\) and column \(j\). The \(i\)'th row of \(A\) is then \(A_{i*}\) and the \(j\)'th column is \(A_{*j}\).

Row and Column Vectorization

Define the column vectorization operator which returns a single column vector containing the columns of \(A\) stacked

\begin{equation*} \newcommand{\cvec}{{\bf\mbox{cvec}}} \cvec(A) = \begin{bmatrix} A_{*1} \\ A_{*2} \\ \vdots \\ A_{*M} \end{bmatrix} \end{equation*}

likewise the row vectorization operator returns a single column vector containing the (transposes of) rows of \(A\) stacked:

\begin{equation*} \newcommand{\rvec}{{\bf\mbox{rvec}}} \rvec(A) = \cvec(A^T) = \begin{bmatrix} A_{1*}^T \\ A_{2*}^T \\ \vdots \\ A_{N*} \end{bmatrix} \end{equation*}

In numpy, matrices are stored in row-major order, so the Python definition of these operaions is inverted as below:

def cvec( A ):
    """Stack columns of A into single column vector"""
    return rvec(A.T)

def rvec( A ):
    """Stack rows of A into a single column vector"""
    return A.ravel()

Similarly, the inverses of \(\cvec(A)\) and \(\rvec(A)\) unpack the vectorized values back into a matrix of the original shape:

\begin{align*} \newcommand{\cmat}{{\bf\mbox{cmat}}} \newcommand{\rmat}{{\bf\mbox{rmat}}} \cmat\left( \cvec(A), M \right) = A \\ \rmat\left( \rvec(A), N \right) = A \\ \end{align*}

with corresponding python definitions:

def rmat( v, nr ):
    """Reshape vector v into matrix with nr rows"""
    return v.reshape((nr,-1))

def cmat( v, nc ):
    """Reshape vector v into matrix with nc columns"""
    return v.reshape((cols,-1)).T

Finally, define two additional operators. The first of these is the spread operator, which takes the Kronecker product between an identity matrix and the input matrix, resulting in copies of the input matrix along the diagonal:

\begin{equation*} \newcommand{\spread}[2]{{\bf\mbox{spread}}_{#2}\left(#1\right)} \spread{A}{r} = I_{r\times r} \otimes A = \begin{bmatrix} A & 0 & \dots & 0 \\ 0 & A & \dots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \dots & A \end{bmatrix} \end{equation*}

The second of these is the split operator which reverses the order of the arguments to the Kronecker product, basically replacing each entry \(A_{ij}\) with \(A_{i,j} I_{r\times r}\):

\begin{equation*} \newcommand{\split}[2]{{\bf\mbox{split}}_{#2}\left(#1\right)} \split{A}{r} = A \otimes I_{r\times r} = \begin{bmatrix} A_{11} I_{r\times r} & A_{12} I_{r\times r} & \dots & A_{1M} I_{r\times r} \\ A_{21} I_{r\times r} & A_{22} I_{r\times r} & \dots & A_{2M} I_{r\times r} \\ \vdots & \vdots & \ddots & \vdots \\ A_{N1} I_{r\times r} & A_{N2} I_{r\times r} & \dots & A_{NM} I_{r\times r} \end{bmatrix} \end{equation*}

Python implementation of these are one-liners that just call the existing numpy Kronecker product:

def spread( M, n ):
    """Make n copies of M along the matrix diagonal"""
    return np.kron( np.eye(n), M )


def split( M, n ):
    """Replace each entry of M with a n-by-n identity matrix scaled by the entry value"""
    return np.kron( M, np.eye(n) )

With these, it is possible to define matrix products. There are also some fun properties that come from the Kronecker product:

\begin{align*} \spread{M}{n}^T &=& \spread{M^T}{n} \\ \spread{M}{n}^{-1} &=& \spread{M^{-1}}{n} \\ \split{M}{n}^T &=& \split{M^T}{n} \\ \split{M}{n}^{-1} &=& \split{M^{-1}}{n} \end{align*}

Matrix Products

Let \(X\) be a \(N\times K\) matrix and \(Y\) be a \(K\times M\) matrix, then their product \(W = XY\) will be a \(N\times M\) matrix and the following are true:

\begin{align*} \rvec(W) &=& \split{X}{M} \rvec(Y) \\ \cvec(W) &=& \spread{X}{M} \cvec(Y) \\ \rvec(W) &=& \spread{Y^T}{N} \rvec(X) \\ \cvec(W) &=& \split{Y^T}{N} \cvec(X) \end{align*}

What these let you do is unbury any (single) term that's in the middle of some horrendous expression. For example, say there's an expression \(W = AXB\) with \(A \in \mathcal{R}^{N\times R}\), \(X \in \mathcal{R}^{R\times S}\) and \(B \in \mathcal{R}^{S \times M}\) and you want to isolate \(X\) as a column vector. Then you have:

\begin{align*} \cvec(W) &=& \spread{A}{M} \split{B^T}{R} \cvec(X) \\ &=& \left( I_{M\times M} \otimes A \right) \left( B^T \otimes I_{R\times R} \right) \cvec(X) \\ &=& \left( I_{M\times M} B^T \right) \otimes \left( A I_{R\times R} \right) \cvec(X) \\ &=& \left( B^T \otimes A \right) \cvec(X) \end{align*}

The step in the third line is the Kronecker product identity:

\begin{equation*} \left( A \otimes B \right) \left( C \otimes D \right ) = \left( A C \right) \otimes \left( B D \right) \end{equation*}

The same steps can also be done if you want row-major storage of \(X\):

\begin{align*} \rvec(W) &=& \split{A}{M} \spread{B^T}{R} \rvec(X) \\ &=& \left( A \otimes I_{M\times M} \right) \left( I_{R\times R} \otimes B^T \right) \rvec(X) \\ &=& \left( A \otimes B^T \right) \rvec(X) \end{align*}

In these cases the dimensions work out correctly but you don't need to worry too much about them since the dimensions of \(X\) are constrained by the column could in \(A\) and the row count in \(B\). This only applies when operating with them symbolically of course.

Frobenius Norm Problems

With this information, it becomes pretty easy to deal with Frobenius norm problems like:

\begin{equation*} X = \mbox{argmin} \frac{1}{2} \| A X B - Y \|_F^2 \end{equation*}

I have found these frustrating in the past since I invariably end up in indexing hell. But it becomes much easier with the identities above. Applying them first of all isolates the X term to the right-hand-side of the expression and also converts the Frobenius norm to an \(L_2\) norm. Making the substitutions \(\tilde{x} = \cvec(X)\) and \(\tilde{y} = \cvec(y)\),

\begin{align*} \tilde{x}_* &=& \mbox{argmin} \frac{1}{2} \| \left( B^T \otimes A \right) \tilde{x} - \tilde{y} \|_2^2 \\ &=& \mbox{argmin} \frac{1}{2} \tilde{x}^T \left( B \otimes A^T \right) \left( B^T \otimes A \right) \tilde{x} - \tilde{x}^T \left( B \otimes A^T \right) \tilde{y} + \mbox{constants} \\ &=& \mbox{argmin} \frac{1}{2} \tilde{x}^T \left( B B^T \otimes A^T A \right) \tilde{x} - \tilde{x}^T \left( B \otimes A^T \right) \tilde{y} \end{align*}

This now looks like a standard least-squares problem, albeit one with this Kronecker product mashed into it. Setting the gradient to zero gives the solution as:

\begin{align*} \tilde{x}_* &=& \left( B B^T \otimes A^T A \right)^{-1} \left( B \otimes A^T \right ) \tilde{y} \\ &=& \left( (B B^T)^{-1} \otimes (A^T A)^{-1} \right) \left( B \otimes A^T \right) \tilde{y} \\ &=& \left( (B B^T)^{-1} B \otimes (A^T A)^{-1} A^T \right) \tilde{y} \end{align*}

This result has a similar form to \(AXB = \left( B^T \otimes A\right)\cvec(X)\), which can be used to get back to the matrix form of the expression:

\begin{align*} X &=& (A^T A)^{-1} A^T Y \left( (B B^T)^{-1} B\right)^T \\ &=& (A^T A)^{-1} A^T Y B^T (B B^T)^{-1} \end{align*}

It's also possible to come at this from a different direction

\begin{align*} X &=& \mbox{argmin} \frac{1}{2} \| A X B - Y \|_F^2 \\ &=& \mbox{argmin} \frac{1}{2} \mbox{Tr}\left( (AXB - Y)(AXB - Y)^T \right) \end{align*}

The gradient of this is given in the Matrix Cookbook.

\begin{equation*} \frac{\partial}{\partial X} \frac{1}{2} \mbox{Tr}\left( (AXB - Y)(AXB - Y)^T \right) = A^T \left( A X B - Y \right) B^T \end{equation*}

Setting the gradient of this to zero and isolating \(X\) gives the following:

\begin{align*} A^T A X B B^T - A^T Y B^T = 0 \\ X = (A^T A)^{-1} A^T Y B^T (B B^T)^{-1} \end{align*}

which matches the previous result.