James Gregson's Website

Nov 10, 2018

Gradients are Row-Vectors

\begin{equation*} \newcommand{pd}[2]{\frac{\partial #1}{\partial #2}} \end{equation*}

Let \(x = \left[ x_1, x_2, x_3 \right]^T\) and \(y_1 = f(x)\) be a scalar function. Then \(\nabla y_1\) is:

\begin{equation*} \nabla y_1 = \left[ \pd{f}{x_1}, \pd{f}{x_2}, \pd{f}{x_3} \right] \end{equation*}

i.e., a row-vector. This allows it to be compatible with the corresponding definition of the Jacobian when, i.e, when

\begin{equation*} y = \begin{bmatrix} y_1 \\ y_2 \end{bmatrix} = \begin{bmatrix} f_1(x) \\ f_2(x) \end{bmatrix} \end{equation*}

then the Jacobian of \(y\) is:

\begin{equation*} J_y = \begin{bmatrix} \pd{f_1}{x_1} & \pd{f_1}{x_2} & \pd{f_1}{x_3} \\ \pd{f_2}{x_1} & \pd{f_2}{x_2} & \pd{f_2}{x_3} \end{bmatrix} = \begin{bmatrix} \nabla f_1 \\ \nabla f_2 \end{bmatrix} \end{equation*}

This also allows linearization of the scalar or vector function to be consistent. Here \(\delta x\)

\begin{align*} y_1(x+\delta x) \approx y_1(x) + \nabla y_1 \delta x \\ y (x+\delta x) \approx y(x) + J_y \delta x \end{align*}

That's great, but where it is useful is in differentiating more complex expressions. For example in the context of optimization or regression \(E = \frac{1}{2} \| y \|^2\). \(E\) is a scalar function of a vector function of a vector and it's a pain to differentiate this symbolically with respect to the parameters \(x\).

What helps here is this:

\begin{equation*} \pd{E}{x} = \pd{E}{y} \pd{y}{x} = y^T J_y \end{equation*}

The transpose on the \(y\) factor is included because it it fits the definition that the gradient of a scalar function taking a vector argument is a row-vector. The above expression expands to:

\begin{equation*} \pd{E}{x} = \left[ y_1, y_2 \right] \begin{bmatrix} \pd{f_1}{x_1} & \pd{f_1}{x_2} & \pd{f_1}{x_3} \\ \pd{f_2}{x_1} & \pd{f_2}{x_2} & \pd{f_2}{x_3} \end{bmatrix} = \begin{bmatrix} y_1 \nabla y_1 + y_2 \nabla y_2 \end{bmatrix} \end{equation*}

which shows that the terms involving \(y_1\) and \(y_2\) are correctly decoupled and only combine through the summation implied by the 2-norm.

What about when \(y = A x - b\)? Well, it's pretty easy to substitute in:

\begin{align*} \pd{E}{y} = y^T \\ \pd{y}{x} = A \\ \pd{E}{x} = (A x - b)^T A \end{align*}

Wait! What's up with that? Isn't it supposed to be \(A^T(A x -b)\)? Well, yes, but that's when gradients are column vectors. Which they're not here. And if you transpose the row vector result you get the expected column vector result.

\begin{equation*} \pd{E}{x}^T = \left( (A x - b)^T A \right)^T = A^T (A x -b) \end{equation*}

Playing devil's advocate

So what happens if you decide to make gradients column vectors? Well, for starters, you have to redefine the del operator from \(\nabla = \left[ \pd{}{x_1}, \pd{}{x_2}, \dots, \pd{}{x_N} \right]\) to:

\begin{equation*} \nabla = \begin{bmatrix} \pd{}{x_1} \\ \pd{}{x_2} \\ \vdots \\ \pd{}{x_3} \end{bmatrix} \end{equation*}

Redefining fundamental mathematical things should be the first clue this is wrong but pressing on using the same definitions, you get:

\begin{equation*} \nabla y_1 = \begin{bmatrix} \pd{y_1}{x_1} \\ \pd{y_1}{x_2} \\ \pd{y_1}{x_3} \end{bmatrix} \end{equation*}

Stacking these gradients column-wise gives something resembling the Jacobian but which is actually its transpose:

\begin{equation*} \pd{y}{x} = J_y^T = \begin{bmatrix} \pd{y_1}{x_1} & \pd{y_2}{x_1} \\ \pd{y_1}{x_2} & \pd{y_2}{x_2} \\ \pd{y_1}{x_3} & \pd{y_2}{x_3} \end{bmatrix} \end{equation*}

The chain rule then gives:

\begin{equation*} \pd{E}{x} = \pd{E}{y} \pd{y}{x} \end{equation*}

which leads to total garbage since the array dimensions are not compatible for matrix multiplication anymore.

\begin{equation*} \pd{E}{x} \neq y J^T = \begin{bmatrix} y_1 \\ y_2 \end{bmatrix} \begin{bmatrix} \pd{y_1}{x_1} & \pd{y_2}{x_1} \\ \pd{y_1}{x_2} & \pd{y_2}{x_2} \\ \pd{y_1}{x_3} & \pd{y_2}{x_3} \end{bmatrix} \hspace{0.25cm}{\bf\mbox{Incompatible!}} \end{equation*}

But this can be fixed by changing the order of the chain rule from \(\pd{E}{x} = \pd{E}{y} \pd{y}{x}\) to \(\pd{E}{x} = \pd{y}{x} \pd{E}{y}\) to get:

\begin{equation*} \pd{E}{x} = J^T y \end{equation*}

which for the example of \(y = A x - b\) gives the expected column oriented gradient of \(\nabla E = A^T (A x - b)\). The difference here is that matrix composition left multiplies new transformations onto an existing one while the chain rule is typically written with right multiplication. This is really the only difference between the column-oriented and row-oriented gradients, other than the discrepancy between the definitions of the gradient and Jacobian matrix. When the Jacobian and gradient are defined correctly (i.e. row-oriented) this does not occur.

What this ends up meaning is that in order for the results to work out correctly with column-oriented gradients you have to carry the Jacobian transpose throughout your work and remember to re-order the chain rule, otherwise you need to forensically reconstruct what the dimensions should be from the scalar terms of your objective function (in optimization anway). I feel this is too high a price to pay.

On the other hand, defining points in one domain as column-vectors and gradients as row-vectors means that expressions involving both need some added boilerplate, e.g.:

\begin{equation*} v = x - \delta \nabla y \hspace{0.25cm}{\bf\mbox{Incompatible!}} \end{equation*}

with \(\delta\) a scalar no longer have compatible dimensions. Instead the equation above would need to be expressed as:

\begin{equation*} v = x - \delta (\nabla y)^T \end{equation*}

This is awkward but tends not to be a big problem since many programming langages have good array broadcasting that allows the transpose to be omitted when programming.

Some Random Example

Let:

\begin{equation*} A = \begin{bmatrix} a_{11} & a_{12} \\ a_{21} & a_{22} \\ a_{31} & a_{32} \end{bmatrix} \end{equation*}

and:

\begin{equation*} B = \begin{bmatrix} b_{11} & b_{12} & b_{13} & b_{14} \\ b_{21} & b_{22} & b_{23} & b_{24} \end{bmatrix} \end{equation*}

and let \(A\) & \(B\) be functions of \(\theta \in \mathcal{R}^5\). Now try to minimize the Frobenius norm of \(Y = AB\):

\begin{equation*} F(\theta) = \frac{1}{2} \| Y \|_F^2 = \mbox{Tr}\left(Y Y^T\right) \end{equation*}

\(F(\theta)\) is a scalar and \(\theta\) is a vector so the gradient should be a row-vector of size \(1\times5\) in the end. Starting with the chain rule an issue is evident:

\begin{equation*} \nabla F(\theta) = Y \pd{Y}{\theta} \end{equation*}

\(Y\) is the wrong shape \(3\times4\) and \(\pd{Y}{\theta}\) is a \(3 \times 4 \times 5\) array where the final index (k) holds \(\pd{Y}{\theta_k}\). In order to make this work, we can concatenate the rows \(y_{i*}\) of \(Y\) to get a \(1\times 12\) matrix \(\tilde{Y}\) i.e.

\begin{equation*} \tilde{Y} = \begin{bmatrix} y_{1*} & y_{2*} & y_{3*} \end{bmatrix} = \mbox{vec}(Y^T)^T \end{equation*}

where \(\mbox{vec}(P)\) stacks the columns of \(P\) into a column vector. We can likewise stack the columns \(y_{*j}\) of \(Y\) to get a \(12x1\) column vector \(\hat{Y}=\mbox{vec}(Y)\). The derivative of this with respect to \(\theta\) is just the Jacobian of each column of \(Y\), stacked row-wise:

\begin{equation*} \pd{\hat{Y}}{\theta} = \begin{bmatrix} \pd{y_{*1}}{\theta} \\ \pd{y_{*2}}{\theta} \\ \pd{y_{*3}}{\theta} \\ \pd{y_{*4}}{\theta} \end{bmatrix} \end{equation*}

By doing this, the sum over all matrix entries implied by the Frobenius norm is absorbed as an inner product of \(\tilde{Y} \in \mathcal{R}^{1\times12}\) and \(\pd{\hat{Y}}{\theta} \in \mathcal{R}^{12\times5}\). This gives the correct dimensions for the gradient of \(1\times 5\). Note that everything works pretty cleanly by adopting the row-vector form for gradients.

We still need to compute the \(\pd{Y_{*i}}{\theta}\) terms. Starting from the definition of \(Y = AB\) it's pretty easy to see that \(y_{*i} = A b_{*i}\). We want to get \(\pd{y_{*i}}{\theta} \in \mathcal{R}^{3\times5}\) and so go to the chain rule:

\begin{equation*} \pd{y_{*i}}{\theta} = A \pd{b_{*i}}{\theta} + \pd{A}{\theta} b_{*i} \end{equation*}

The first term is no problem, it is simply a transform (\(A\)) applied to the Jacobian of \(b_{*i}\). The shapes work out correctly. The second term is more problematic, \(\pd{A}{\theta}\) is a \(3\times 2 \times 5\) array while \(b_{*i}\) is \(2 \times 1\). One way to address this is to stack rows of \(A\) into a column vector using the vectorization operator and splay \(b_{*i}\) using the Kronecker Product:

\begin{equation*} A b_{*i} = \begin{bmatrix} b_{*i}^T & {\bf 0} & {\bf 0} \\ {\bf 0} & b_{*i}^T & {\bf 0} \\ {\bf 0} & {\bf 0} & b_{*i}^T \end{bmatrix} \begin{bmatrix} a_{1*}^T \\ a_{2*}^T \\ a_{3*}^T \end{bmatrix} = \left( {\bf I_{3\times3}} \otimes b_{*i}^T \right) \mbox{vec}(A^T) \end{equation*}