#### Nov 04, 2019

These are collections of functions that I end up re-implementing frequently.

from typing import *
import numpy as np
def closest_rotation( M: np.ndarray ):
U,s,Vt = np.linalg.svd( M[:3,:3] )
if np.linalg.det(U@Vt) < 0:
Vt[2] *= -1.0
return U@Vt
def mat2quat( M: np.ndarray ):
if np.abs(np.linalg.det(M[:3,:3])-1.0) > 1e-5:
raise ValueError('Matrix determinant is not 1')
if np.abs( np.linalg.norm( M[:3,:3].T@M[:3,:3] - np.eye(3)) ) > 1e-5:
raise ValueError('Matrix is not orthogonal')
w = np.sqrt( 1.0 + M[0,0]+M[1,1]+M[2,2])/2.0
x = (M[2,1]-M[1,2])/(4*w)
y = (M[0,2]-M[2,0])/(4*w)
z = (M[1,0]-M[0,1])/(4*w)
return np.array((x,y,z,w),float)
def quat2mat( q: np.ndarray ):
qx,qy,qz,qw = q/np.linalg.norm(q)
return np.array((
(1 - 2*qy*qy - 2*qz*qz, 2*qx*qy - 2*qz*qw, 2*qx*qz + 2*qy*qw),
( 2*qx*qy + 2*qz*qw, 1 - 2*qx*qx - 2*qz*qz, 2*qy*qz - 2*qx*qw),
( 2*qx*qz - 2*qy*qw, 2*qy*qz + 2*qx*qw, 1 - 2*qx*qx - 2*qy*qy),
))
if __name__ == '__main__':
for i in range( 1000 ):
q = np.random.standard_normal(4)
q /= np.linalg.norm(q)
A = quat2mat( q )
s = mat2quat( A )
if np.linalg.norm( s-q ) > 1e-6 and np.linalg.norm( s+q ) > 1e-6:
raise ValueError('uh oh.')
R = closest_rotation( A+np.random.standard_normal((3,3))*1e-6 )
if np.linalg.norm(A-R) > 1e-5:
print( np.linalg.norm( A-R ) )
raise ValueError('uh oh.')

#### Nov 03, 2019

Python is a pretty nice language for prototyping and (with numpy and scipy) can be quite fast for numerics but is not great for algorithms with frequent tight loops. One place I run into this frequently is in graphics, particularly applications involving meshes.

Not much can be done if you need to frequently modify mesh structure. Luckily, many algorithms operate on meshes with fixed structure. In this case, expressing algorithms in terms of block operations can be much faster. Actually implementing methods in this way can be tricky though.

To gain more experience in this, I decided to try implementing the As-Rigid-As-Possible Surface Modeling mesh deformation algorithm in Python. This involves solving moderately sized Laplace systems where the right-hand-side involves some non-trivial computation (per-vertex SVDs embedded within evaluation of the Laplace operator). For details on the algorithm, see the link.

As it turns out, the entire implementation ended up being just 99 lines and solves 1600 vertex problems in just under 20ms (more on this later). Full code is available.

## Building the Laplace operator

ARAP uses cotangent weights for the Laplace operator in order to be relatively independent of mesh grading. These weights are functions of the angles opposite the edges emanating from each vertex. Building this operator using loops is slow but can be sped up considerably by realizing that every triangle has exactly 3 vertices, 3 angles and 3 edges. Operations can be evaluated on triangles and accumulated to edges.

Here's the Python code for a cotangent Laplace operator:

from typing import *
import numpy as np
import scipy.sparse as sparse
import scipy.sparse.linalg as spla
def build_cotan_laplacian( points: np.ndarray, tris: np.ndarray ):
a,b,c = (tris[:,0],tris[:,1],tris[:,2])
A = np.take( points, a, axis=1 )
B = np.take( points, b, axis=1 )
C = np.take( points, c, axis=1 )
eab,ebc,eca = (B-A, C-B, A-C)
eab = eab/np.linalg.norm(eab,axis=0)[None,:]
ebc = ebc/np.linalg.norm(ebc,axis=0)[None,:]
eca = eca/np.linalg.norm(eca,axis=0)[None,:]
alpha = np.arccos( -np.sum(eca*eab,axis=0) )
beta = np.arccos( -np.sum(eab*ebc,axis=0) )
gamma = np.arccos( -np.sum(ebc*eca,axis=0) )
wab,wbc,wca = ( 1.0/np.tan(gamma), 1.0/np.tan(alpha), 1.0/np.tan(beta) )
rows = np.concatenate(( a, b, a, b, b, c, b, c, c, a, c, a ), axis=0 )
cols = np.concatenate(( a, b, b, a, b, c, c, b, c, a, a, c ), axis=0 )
vals = np.concatenate(( wab, wab,-wab,-wab, wbc, wbc,-wbc,-wbc, wca, wca,-wca, -wca), axis=0 )
L = sparse.coo_matrix((vals,(rows,cols)),shape=(points.shape[1],points.shape[1]), dtype=float).tocsc()
return L

Inputs are float \(3 \times N_{verts}\) and integer \(N_{tris} \times 3\) arrays for vertices and triangles respectively. The code works by making arrays containing vertex indices and positions for all triangles, then computes and normalizes edges and finally computes angles and cotangents. Once the cotangents are available, initialization arrays for a scipy.sparse.coo_matrix are constructed and the matrix built. The coo_matrix is intended for finite-element matrices so it accumulates rather than overwrites duplicate values. The code is definitely faster and most likely fewer lines than a loop based implementation.

## Evaluating the Right-Hand-side

The Laplace operator is just a warm-up exercise compared to the right-hand-side computation. This involves:

- For every vertex, collecting the edge-vectors in its one-ring in both original and deformed coordinates
- Computing the rotation matrices that optimally aligns these neighborhoods, weighted by the Laplacian edge weights, per-vertex. This involves a SVD per-vertex.
- Rotating the undeformed neighborhoods by the per-vertex rotation and summing the contributions.

In order to vectorize this, I constructed neighbor and weight arrays for the Laplace operator. These store the neighboring vertex indices and the corresponding (positive) weights from the Laplace operator. A catch is that vertices have different numbers of neighbors. To address this I use fixed sized arrays based on the maximum valence in the mesh. These are initialized with default values of the vertex ids and weights of zero so that unused vertices contribute nothing and do not affect matrix structure.

To compute the rotation matrices, I used some index magic to accumulate the weighted neighborhoods and rely on numpy's batch SVD operation to vectorize over all vertices. Finally, I used the same index magic to rotate the neighborhoods and accumulate the right-hand-side.

The code to accumulate the neighbor and weight arrays is:

def build_weights_and_adjacency( points: np.ndarray, tris: np.ndarray, L: Optional[sparse.csc_matrix]=None ):
L = L if L is not None else build_cotan_laplacian( points, tris )
n_pnts, n_nbrs = (points.shape[1], L.getnnz(axis=0).max()-1)
nbrs = np.ones((n_pnts,n_nbrs),dtype=int)*np.arange(n_pnts,dtype=int)[:,None]
wgts = np.zeros((n_pnts,n_nbrs),dtype=float)
for idx,col in enumerate(L):
msk = col.indices != idx
indices = col.indices[msk]
values = col.data[msk]
nbrs[idx,:len(indices)] = indices
wgts[idx,:len(indices)] = -values

Rather than break down all the other steps individually, here is the code for an ARAP class:

class ARAP:
def __init__( self, points: np.ndarray, tris: np.ndarray, anchors: List[int], anchor_weight: Optional[float]=10.0, L: Optional[sparse.csc_matrix]=None ):
self._pnts = points.copy()
self._tris = tris.copy()
self._nbrs, self._wgts, self._L = build_weights_and_adjacency( self._pnts, self._tris, L )
self._anchors = list(anchors)
self._anc_wgt = anchor_weight
E = sparse.dok_matrix((self.n_pnts,self.n_pnts),dtype=float)
for i in anchors:
E[i,i] = 1.0
E = E.tocsc()
self._solver = spla.factorized( ( self._L.T@self._L + self._anc_wgt*E.T@E).tocsc() )
@property
def n_pnts( self ):
return self._pnts.shape[1]
@property
def n_dims( self ):
return self._pnts.shape[0]
def __call__( self, anchors: Dict[int,Tuple[float,float,float]], num_iters: Optional[int]=4 ):
con_rhs = self._build_constraint_rhs(anchors)
R = np.array([np.eye(self.n_dims) for _ in range(self.n_pnts)])
def_points = self._solver( self._L.T@self._build_rhs(R) + self._anc_wgt*con_rhs )
for i in range(num_iters):
R = self._estimate_rotations( def_points.T )
def_points = self._solver( self._L.T@self._build_rhs(R) + self._anc_wgt*con_rhs )
return def_points.T
def _estimate_rotations( self, def_pnts: np.ndarray ):
tru_hood = (np.take( self._pnts, self._nbrs, axis=1 ).transpose((1,0,2)) - self._pnts.T[...,None])*self._wgts[:,None,:]
rot_hood = (np.take( def_pnts, self._nbrs, axis=1 ).transpose((1,0,2)) - def_pnts.T[...,None])
U,s,Vt = np.linalg.svd( rot_hood@tru_hood.transpose((0,2,1)) )
R = U@Vt
dets = np.linalg.det(R)
Vt[:,self.n_dims-1,:] *= dets[:,None]
R = U@Vt
return R
def _build_rhs( self, rotations: np.ndarray ):
R = (np.take( rotations, self._nbrs, axis=0 )+rotations[:,None])*0.5
tru_hood = (self._pnts.T[...,None]-np.take( self._pnts, self._nbrs, axis=1 ).transpose((1,0,2)))*self._wgts[:,None,:]
rhs = np.sum( (R@tru_hood.transpose((0,2,1))[...,None]).squeeze(), axis=1 )
return rhs
def _build_constraint_rhs( self, anchors: Dict[int,Tuple[float,float,float]] ):
f = np.zeros((self.n_pnts,self.n_dims),dtype=float)
f[self._anchors,:] = np.take( self._pnts, self._anchors, axis=1 ).T
for i,v in anchors.items():
if i not in self._anchors:
raise ValueError('Supplied anchor was not included in list provided at construction!')
f[i,:] = v
return f

The indexing was tricky to figure out. Error handling and input checking can obviously be improved. Something I'm not going into are the handling of anchor vertices and the overall quadratic forms that are being solved.

## So does it Work?

Yes. Here is a 2D equivalent to the bar example from the paper:

import time
import numpy as np
import matplotlib.pyplot as plt
import arap
def grid_mesh_2d( nx, ny, h ):
x,y = np.meshgrid( np.linspace(0.0,(nx-1)*h,nx), np.linspace(0.0,(ny-1)*h,ny))
idx = np.arange(nx*ny,dtype=int).reshape((ny,nx))
quads = np.column_stack(( idx[:-1,:-1].flat, idx[1:,:-1].flat, idx[1:,1:].flat, idx[:-1,1:].flat ))
tris = np.vstack((quads[:,(0,1,2)],quads[:,(0,2,3)]))
return np.row_stack((x.flat,y.flat)), tris, idx
nx,ny,h = (21,5,0.1)
pnts, tris, ix = grid_mesh_2d( nx,ny,h )
anchors = {}
for i in range(ny):
anchors[ix[i,0]] = ( 0.0, i*h)
anchors[ix[i,1]] = ( h, i*h)
anchors[ix[i,nx-2]] = (h*nx*0.5-h, h*nx*0.5+i*h)
anchors[ix[i,nx-1]] = ( h*nx*0.5, h*nx*0.5+i*h)
deformer = arap.ARAP( pnts, tris, anchors.keys(), anchor_weight=1000 )
start = time.time()
def_pnts = deformer( anchors, num_iters=2 )
end = time.time()
plt.triplot( pnts[0], pnts[1], tris, 'k-', label='original' )
plt.triplot( def_pnts[0], def_pnts[1], tris, 'r-', label='deformed' )
plt.legend()
plt.title('deformed in {:0.2f}ms'.format((end-start)*1000.0))
plt.show()

The result of this script is the following:

This is qualitatively quite close to the result for two iterations in Figure 3 of the ARAP paper. Note that the timings may be a bit unreliable for such a small problem size.

A corresponding result for 3D deformations is here (note that this corrects an earlier bug in rotation computations for 3D):

#### Nov 15, 2018

Here are a few operations for dealing with objective functions of matrix-valued variables:

## Starting Definitions

Let \(A\) be a \(N\times M\) matrix with entry \(A_{ij}\) being the entry at row \(i\) and column \(j\). The \(i\)'th row of \(A\) is then \(A_{i*}\) and the \(j\)'th column is \(A_{*j}\).

## Row and Column Vectorization

Define the column vectorization operator which returns a single column vector containing the columns of \(A\) stacked

\begin{equation*}
\newcommand{\cvec}{{\bf\mbox{cvec}}}
\cvec(A) = \begin{bmatrix}
A_{*1} \\
A_{*2} \\
\vdots \\
A_{*M}
\end{bmatrix}
\end{equation*}

likewise the row vectorization operator returns a single column vector containing the (transposes of) rows of \(A\) stacked:

\begin{equation*}
\newcommand{\rvec}{{\bf\mbox{rvec}}}
\rvec(A) = \cvec(A^T) = \begin{bmatrix}
A_{1*}^T \\
A_{2*}^T \\
\vdots \\
A_{N*}
\end{bmatrix}
\end{equation*}

In numpy, matrices are stored in row-major order, so the Python definition of these operaions is inverted as below:

def cvec( A ):
"""Stack columns of A into single column vector"""
return rvec(A.T)
def rvec( A ):
"""Stack rows of A into a single column vector"""
return A.ravel()

Similarly, the inverses of \(\cvec(A)\) and \(\rvec(A)\) unpack the vectorized values back into a matrix of the original shape:

\begin{align*}
\newcommand{\cmat}{{\bf\mbox{cmat}}}
\newcommand{\rmat}{{\bf\mbox{rmat}}}
\cmat\left( \cvec(A), M \right) = A \\
\rmat\left( \rvec(A), N \right) = A \\
\end{align*}

with corresponding python definitions:

def rmat( v, nr ):
"""Reshape vector v into matrix with nr rows"""
return v.reshape((nr,-1))
def cmat( v, nc ):
"""Reshape vector v into matrix with nc columns"""
return v.reshape((cols,-1)).T

Finally, define two additional operators. The first of these is the spread operator, which takes the Kronecker product between an identity matrix and the input matrix, resulting in copies of the input matrix along the diagonal:

\begin{equation*}
\newcommand{\spread}[2]{{\bf\mbox{spread}}_{#2}\left(#1\right)}
\spread{A}{r} = I_{r\times r} \otimes A = \begin{bmatrix}
A & 0 & \dots & 0 \\
0 & A & \dots & 0 \\
\vdots & \vdots & \ddots & \vdots \\
0 & 0 & \dots & A
\end{bmatrix}
\end{equation*}

The second of these is the split operator which reverses the order of the arguments to the Kronecker product, basically replacing each entry \(A_{ij}\) with \(A_{i,j} I_{r\times r}\):

\begin{equation*}
\newcommand{\split}[2]{{\bf\mbox{split}}_{#2}\left(#1\right)}
\split{A}{r} = A \otimes I_{r\times r} = \begin{bmatrix}
A_{11} I_{r\times r} & A_{12} I_{r\times r} & \dots & A_{1M} I_{r\times r} \\
A_{21} I_{r\times r} & A_{22} I_{r\times r} & \dots & A_{2M} I_{r\times r} \\
\vdots & \vdots & \ddots & \vdots \\
A_{N1} I_{r\times r} & A_{N2} I_{r\times r} & \dots & A_{NM} I_{r\times r}
\end{bmatrix}
\end{equation*}

Python implementation of these are one-liners that just call the existing numpy Kronecker product:

def spread( M, n ):
"""Make n copies of M along the matrix diagonal"""
return np.kron( np.eye(n), M )
def split( M, n ):
"""Replace each entry of M with a n-by-n identity matrix scaled by the entry value"""
return np.kron( M, np.eye(n) )

With these, it is possible to define matrix products. There are also some fun properties that come from the Kronecker product:

\begin{align*}
\spread{M}{n}^T &=& \spread{M^T}{n} \\
\spread{M}{n}^{-1} &=& \spread{M^{-1}}{n} \\
\split{M}{n}^T &=& \split{M^T}{n} \\
\split{M}{n}^{-1} &=& \split{M^{-1}}{n}
\end{align*}

## Matrix Products

Let \(X\) be a \(N\times K\) matrix and \(Y\) be a \(K\times M\) matrix, then their product \(W = XY\) will be a \(N\times M\) matrix and the following are true:

\begin{align*}
\rvec(W) &=& \split{X}{M} \rvec(Y) \\
\cvec(W) &=& \spread{X}{M} \cvec(Y) \\
\rvec(W) &=& \spread{Y^T}{N} \rvec(X) \\
\cvec(W) &=& \split{Y^T}{N} \cvec(X)
\end{align*}

What these let you do is unbury any (single) term that's in the middle of some horrendous expression. For example, say there's an expression \(W = AXB\) with \(A \in \mathcal{R}^{N\times R}\), \(X \in \mathcal{R}^{R\times S}\) and \(B \in \mathcal{R}^{S \times M}\) and you want to isolate \(X\) as a column vector. Then you have:

\begin{align*}
\cvec(W) &=& \spread{A}{M} \split{B^T}{R} \cvec(X) \\
&=& \left( I_{M\times M} \otimes A \right) \left( B^T \otimes I_{R\times R} \right) \cvec(X) \\
&=& \left( I_{M\times M} B^T \right) \otimes \left( A I_{R\times R} \right) \cvec(X) \\
&=& \left( B^T \otimes A \right) \cvec(X)
\end{align*}

The step in the third line is the Kronecker product identity:

\begin{equation*}
\left( A \otimes B \right) \left( C \otimes D \right ) = \left( A C \right) \otimes \left( B D \right)
\end{equation*}

The same steps can also be done if you want row-major storage of \(X\):

\begin{align*}
\rvec(W) &=& \split{A}{M} \spread{B^T}{R} \rvec(X) \\
&=& \left( A \otimes I_{M\times M} \right) \left( I_{R\times R} \otimes B^T \right) \rvec(X) \\
&=& \left( A \otimes B^T \right) \rvec(X)
\end{align*}

In these cases the dimensions work out correctly but you don't need to worry too much about them since the dimensions of \(X\) are constrained by the column could in \(A\) and the row count in \(B\). This only applies when operating with them symbolically of course.

## Frobenius Norm Problems

With this information, it becomes pretty easy to deal with Frobenius norm problems like:

\begin{equation*}
X = \mbox{argmin} \frac{1}{2} \| A X B - Y \|_F^2
\end{equation*}

I have found these frustrating in the past since I invariably end up in indexing hell. But it becomes much easier with the identities above. Applying them first of all isolates the X term to the right-hand-side of the expression and also converts the Frobenius norm to an \(L_2\) norm. Making the substitutions \(\tilde{x} = \cvec(X)\) and \(\tilde{y} = \cvec(y)\),

\begin{align*}
\tilde{x}_* &=& \mbox{argmin} \frac{1}{2} \| \left( B^T \otimes A \right) \tilde{x} - \tilde{y} \|_2^2 \\
&=& \mbox{argmin} \frac{1}{2} \tilde{x}^T \left( B \otimes A^T \right) \left( B^T \otimes A \right) \tilde{x} - \tilde{x}^T \left( B \otimes A^T \right) \tilde{y} + \mbox{constants} \\
&=& \mbox{argmin} \frac{1}{2} \tilde{x}^T \left( B B^T \otimes A^T A \right) \tilde{x} - \tilde{x}^T \left( B \otimes A^T \right) \tilde{y}
\end{align*}

This now looks like a standard least-squares problem, albeit one with this Kronecker product mashed into it. Setting the gradient to zero gives the solution as:

\begin{align*}
\tilde{x}_* &=& \left( B B^T \otimes A^T A \right)^{-1} \left( B \otimes A^T \right ) \tilde{y} \\
&=& \left( (B B^T)^{-1} \otimes (A^T A)^{-1} \right) \left( B \otimes A^T \right) \tilde{y} \\
&=& \left( (B B^T)^{-1} B \otimes (A^T A)^{-1} A^T \right) \tilde{y}
\end{align*}

This result has a similar form to \(AXB = \left( B^T \otimes A\right)\cvec(X)\), which can be used to get back to the matrix form of the expression:

\begin{align*}
X &=& (A^T A)^{-1} A^T Y \left( (B B^T)^{-1} B\right)^T \\
&=& (A^T A)^{-1} A^T Y B^T (B B^T)^{-1}
\end{align*}

It's also possible to come at this from a different direction

\begin{align*}
X &=& \mbox{argmin} \frac{1}{2} \| A X B - Y \|_F^2 \\
&=& \mbox{argmin} \frac{1}{2} \mbox{Tr}\left( (AXB - Y)(AXB - Y)^T \right)
\end{align*}

The gradient of this is given in the Matrix Cookbook.

\begin{equation*}
\frac{\partial}{\partial X} \frac{1}{2} \mbox{Tr}\left( (AXB - Y)(AXB - Y)^T \right) = A^T \left( A X B - Y \right) B^T
\end{equation*}

Setting the gradient of this to zero and isolating \(X\) gives the following:

\begin{align*}
A^T A X B B^T - A^T Y B^T = 0 \\
X = (A^T A)^{-1} A^T Y B^T (B B^T)^{-1}
\end{align*}

which matches the previous result.

#### Nov 10, 2018

\begin{equation*}
\newcommand{pd}[2]{\frac{\partial #1}{\partial #2}}
\end{equation*}

Let \(x = \left[ x_1, x_2, x_3 \right]^T\) and \(y_1 = f(x)\) be a scalar function. Then \(\nabla y_1\) is:

\begin{equation*}
\nabla y_1 = \left[ \pd{f}{x_1}, \pd{f}{x_2}, \pd{f}{x_3} \right]
\end{equation*}

i.e., a row-vector. This allows it to be compatible with the corresponding definition of the Jacobian when, i.e, when

\begin{equation*}
y = \begin{bmatrix} y_1 \\ y_2 \end{bmatrix} = \begin{bmatrix} f_1(x) \\ f_2(x) \end{bmatrix}
\end{equation*}

then the Jacobian of \(y\) is:

\begin{equation*}
J_y = \begin{bmatrix}
\pd{f_1}{x_1} & \pd{f_1}{x_2} & \pd{f_1}{x_3} \\
\pd{f_2}{x_1} & \pd{f_2}{x_2} & \pd{f_2}{x_3}
\end{bmatrix} = \begin{bmatrix}
\nabla f_1 \\
\nabla f_2
\end{bmatrix}
\end{equation*}

This also allows linearization of the scalar or vector function to be consistent. Here \(\delta x\)

\begin{align*}
y_1(x+\delta x) \approx y_1(x) + \nabla y_1 \delta x \\
y (x+\delta x) \approx y(x) + J_y \delta x
\end{align*}

That's great, but where it is useful is in differentiating more complex expressions. For example in the context of optimization or regression \(E = \frac{1}{2} \| y \|^2\). \(E\) is a scalar function of a vector function of a vector and it's a pain to differentiate this symbolically with respect to the parameters \(x\).

What helps here is this:

\begin{equation*}
\pd{E}{x} = \pd{E}{y} \pd{y}{x} = y^T J_y
\end{equation*}

The transpose on the \(y\) factor is included because it it fits the definition that the gradient of a scalar function taking a vector argument is a row-vector. The above expression expands to:

\begin{equation*}
\pd{E}{x} = \left[ y_1, y_2 \right] \begin{bmatrix}
\pd{f_1}{x_1} & \pd{f_1}{x_2} & \pd{f_1}{x_3} \\
\pd{f_2}{x_1} & \pd{f_2}{x_2} & \pd{f_2}{x_3}
\end{bmatrix} = \begin{bmatrix} y_1 \nabla y_1 + y_2 \nabla y_2 \end{bmatrix}
\end{equation*}

which shows that the terms involving \(y_1\) and \(y_2\) are correctly decoupled and only combine through the summation implied by the 2-norm.

What about when \(y = A x - b\)? Well, it's pretty easy to substitute in:

\begin{align*}
\pd{E}{y} = y^T \\
\pd{y}{x} = A \\
\pd{E}{x} = (A x - b)^T A
\end{align*}

Wait! What's up with that? Isn't it supposed to be \(A^T(A x -b)\)? Well, yes, but that's when gradients are column vectors. Which they're not here. And if you transpose the row vector result you get the expected column vector result.

\begin{equation*}
\pd{E}{x}^T = \left( (A x - b)^T A \right)^T = A^T (A x -b)
\end{equation*}

## Playing devil's advocate

So what happens if you decide to make gradients column vectors? Well, for starters, you have to redefine the del operator from \(\nabla = \left[ \pd{}{x_1}, \pd{}{x_2}, \dots, \pd{}{x_N} \right]\) to:

\begin{equation*}
\nabla = \begin{bmatrix}
\pd{}{x_1} \\
\pd{}{x_2} \\
\vdots \\
\pd{}{x_3}
\end{bmatrix}
\end{equation*}

Redefining fundamental mathematical things should be the first clue this is wrong but pressing on using the same definitions, you get:

\begin{equation*}
\nabla y_1 = \begin{bmatrix} \pd{y_1}{x_1} \\ \pd{y_1}{x_2} \\ \pd{y_1}{x_3} \end{bmatrix}
\end{equation*}

Stacking these gradients column-wise gives something resembling the Jacobian but which is actually its transpose:

\begin{equation*}
\pd{y}{x} = J_y^T = \begin{bmatrix}
\pd{y_1}{x_1} & \pd{y_2}{x_1} \\
\pd{y_1}{x_2} & \pd{y_2}{x_2} \\
\pd{y_1}{x_3} & \pd{y_2}{x_3}
\end{bmatrix}
\end{equation*}

The chain rule then gives:

\begin{equation*}
\pd{E}{x} = \pd{E}{y} \pd{y}{x}
\end{equation*}

which leads to total garbage since the array dimensions are not compatible for matrix multiplication anymore.

\begin{equation*}
\pd{E}{x} \neq y J^T = \begin{bmatrix} y_1 \\ y_2 \end{bmatrix} \begin{bmatrix}
\pd{y_1}{x_1} & \pd{y_2}{x_1} \\
\pd{y_1}{x_2} & \pd{y_2}{x_2} \\
\pd{y_1}{x_3} & \pd{y_2}{x_3}
\end{bmatrix} \hspace{0.25cm}{\bf\mbox{Incompatible!}}
\end{equation*}

But this can be fixed by changing the order of the chain rule from \(\pd{E}{x} = \pd{E}{y} \pd{y}{x}\) to \(\pd{E}{x} = \pd{y}{x} \pd{E}{y}\) to get:

\begin{equation*}
\pd{E}{x} = J^T y
\end{equation*}

which for the example of \(y = A x - b\) gives the expected column oriented gradient of \(\nabla E = A^T (A x - b)\). The difference here is that matrix composition **left multiplies** new transformations onto an existing one while the chain rule is typically written with **right multiplication**. This is really the only difference between the column-oriented and row-oriented gradients, other than the discrepancy between the definitions of the gradient and Jacobian matrix. When the Jacobian and gradient are defined correctly (i.e. **row-oriented**) this does not occur.

What this ends up meaning is that in order for the results to work out correctly with column-oriented gradients you have to carry the Jacobian transpose throughout your work **and** remember to re-order the chain rule, otherwise you need to forensically reconstruct what the dimensions should be from the scalar terms of your objective function (in optimization anway). I feel this is too high a price to pay.

On the other hand, defining points in one domain as column-vectors and gradients as row-vectors means that expressions involving both need some added boilerplate, e.g.:

\begin{equation*}
v = x - \delta \nabla y \hspace{0.25cm}{\bf\mbox{Incompatible!}}
\end{equation*}

with \(\delta\) a scalar no longer have compatible dimensions. Instead the equation above would need to be expressed as:

\begin{equation*}
v = x - \delta (\nabla y)^T
\end{equation*}

This is awkward but tends not to be a big problem since many programming langages have good array broadcasting that allows the transpose to be omitted when programming.

## Some Random Example

Let:

\begin{equation*}
A = \begin{bmatrix}
a_{11} & a_{12} \\
a_{21} & a_{22} \\
a_{31} & a_{32}
\end{bmatrix}
\end{equation*}

and:

\begin{equation*}
B = \begin{bmatrix}
b_{11} & b_{12} & b_{13} & b_{14} \\
b_{21} & b_{22} & b_{23} & b_{24}
\end{bmatrix}
\end{equation*}

and let \(A\) & \(B\) be functions of \(\theta \in \mathcal{R}^5\). Now try to minimize the Frobenius norm of \(Y = AB\):

\begin{equation*}
F(\theta) = \frac{1}{2} \| Y \|_F^2 = \mbox{Tr}\left(Y Y^T\right)
\end{equation*}

\(F(\theta)\) is a scalar and \(\theta\) is a vector so the gradient should be a row-vector of size \(1\times5\) in the end. Starting with the chain rule an issue is evident:

\begin{equation*}
\nabla F(\theta) = Y \pd{Y}{\theta}
\end{equation*}

\(Y\) is the wrong shape \(3\times4\) and \(\pd{Y}{\theta}\) is a \(3 \times 4 \times 5\) array where the final index (k) holds \(\pd{Y}{\theta_k}\). In order to make this work, we can concatenate the rows \(y_{i*}\) of \(Y\) to get a \(1\times 12\) matrix \(\tilde{Y}\) i.e.

\begin{equation*}
\tilde{Y} = \begin{bmatrix}
y_{1*} & y_{2*} & y_{3*}
\end{bmatrix} = \mbox{vec}(Y^T)^T
\end{equation*}

where \(\mbox{vec}(P)\) stacks the columns of \(P\) into a column vector. We can likewise stack the columns \(y_{*j}\) of \(Y\) to get a \(12x1\) column vector \(\hat{Y}=\mbox{vec}(Y)\). The derivative of this with respect to \(\theta\) is just the Jacobian of each column of \(Y\), stacked row-wise:

\begin{equation*}
\pd{\hat{Y}}{\theta} = \begin{bmatrix}
\pd{y_{*1}}{\theta} \\
\pd{y_{*2}}{\theta} \\
\pd{y_{*3}}{\theta} \\
\pd{y_{*4}}{\theta}
\end{bmatrix}
\end{equation*}

By doing this, the sum over all matrix entries implied by the Frobenius norm is absorbed as an inner product of \(\tilde{Y} \in \mathcal{R}^{1\times12}\) and \(\pd{\hat{Y}}{\theta} \in \mathcal{R}^{12\times5}\). This gives the correct dimensions for the gradient of \(1\times 5\). Note that everything works pretty cleanly by adopting the row-vector form for gradients.

We still need to compute the \(\pd{Y_{*i}}{\theta}\) terms. Starting from the definition of \(Y = AB\) it's pretty easy to see that \(y_{*i} = A b_{*i}\). We want to get \(\pd{y_{*i}}{\theta} \in \mathcal{R}^{3\times5}\) and so go to the chain rule:

\begin{equation*}
\pd{y_{*i}}{\theta} = A \pd{b_{*i}}{\theta} + \pd{A}{\theta} b_{*i}
\end{equation*}

The first term is no problem, it is simply a transform (\(A\)) applied to the Jacobian of \(b_{*i}\). The shapes work out correctly. The second term is more problematic, \(\pd{A}{\theta}\) is a \(3\times 2 \times 5\) array while \(b_{*i}\) is \(2 \times 1\). One way to address this is to stack rows of \(A\) into a column vector using the vectorization operator and splay \(b_{*i}\) using the Kronecker Product:

\begin{equation*}
A b_{*i} = \begin{bmatrix}
b_{*i}^T & {\bf 0} & {\bf 0} \\
{\bf 0} & b_{*i}^T & {\bf 0} \\
{\bf 0} & {\bf 0} & b_{*i}^T
\end{bmatrix} \begin{bmatrix}
a_{1*}^T \\
a_{2*}^T \\
a_{3*}^T
\end{bmatrix} = \left( {\bf I_{3\times3}} \otimes b_{*i}^T \right)
\mbox{vec}(A^T)
\end{equation*}

#### Nov 02, 2018

This page describes a basic approach to extracting Euler angles from rotation matrices. This follows what OpenCV does, (which is where I got the process) but I also wanted to understand **why** this approach was used.

Here I'm considering rotation performed first around x-axis (\(\alpha\)), then y-axis (\(\beta\)) and finally z-axis (\(\gamma\)). For this, the rotation matrix is shown below. Other rotation orders are analogous.

\begin{equation*}
R(\alpha,\beta,\gamma) = \left[\begin{matrix}\cos{\left (\beta \right )} \cos{\left (\gamma \right )} & \sin{\left (\alpha \right )} \sin{\left (\beta \right )} \cos{\left (\gamma \right )} - \sin{\left (\gamma \right )} \cos{\left (\alpha \right )} & \sin{\left (\alpha \right )} \sin{\left (\gamma \right )} + \sin{\left (\beta \right )} \cos{\left (\alpha \right )} \cos{\left (\gamma \right )} & 0\\\sin{\left (\gamma \right )} \cos{\left (\beta \right )} & \sin{\left (\alpha\right )} \sin{\left (\beta \right )} \sin{\left (\gamma \right )} + \cos{\left (\alpha \right )} \cos{\left (\gamma \right )} & - \sin{\left (\alpha \right )} \cos{\left (\gamma \right )} + \sin{\left (\beta \right )} \sin{\left (\gamma \right )} \cos{\left (\alpha \right )} & 0\\- \sin{\left (\beta \right )} & \sin{\left (\alpha \right )} \cos{\left (\beta \right )} & \cos{\left (\alpha \right )} \cos{\left (\beta \right )} & 0\\0 & 0 & 0 & 1\end{matrix}\right]
\end{equation*}

To solve for the Euler angles, we can use a little bit of trigonometry. The best way, in my opinion, is to find pairs of entries that are each products of two factors. Using the identity \(\sin^2 + \cos^2 = 1\) can be used to isolate specific angles, e.g. \(R_{0,0}\) and \(R_{0,1}\) can be used to find \(\cos(\beta)\) which unlocks the rest of the angles:

\begin{align*}
(-1)^k \cos(\beta + k \pi) = \sqrt{ \cos^2(\beta) ( \sin^2(\gamma) + \cos^2(\gamma)) } = \sqrt{ R_{0,0}^2 + R_{1,0}^2 } \\
\cos(\beta) \approx \sqrt{ R_{0,0}^2 + R_{1,0}^2 }
\end{align*}

**Pay attention to the first line:** For any true value of \(\beta\), this formula will return the same value for \(\beta + k\pi\) where \(k\) is an arbitrary integer. This will in turn determine the angles \(\alpha\) and \(\gamma\) in order to be compatible. The approximate equality will return \(\beta \in [-\pi/2, \pi/2]\).

Once \(\cos(\beta)\) is found we can find the angles directly using atan2:

\begin{align*}
\beta = \mbox{atan2}( -R_{2,0}, \cos(\beta) ) \\
\alpha = \mbox{atan2}\left( \frac{R_{2,1}}{\cos(\beta)}, \frac{R_{2,2}}{\cos(\beta)} \right) \\
\gamma = \mbox{atan2}\left( \frac{R_{1,0}}{\cos(\beta)}, \frac{R_{0,0}}{\cos(\beta)} \right)
\end{align*}

**The other issue** occurs when \(|\cos(\beta)| = 0\), which causes division by zero in the equations for \(\alpha,\gamma\) (\(\beta\) is still well defined). In this case there are a number of options. What OpenCV does is to arbitrarily decide that \(\gamma = 0\), which means that \(\sin(\gamma) = 0\) and \(\cos(\gamma) = 1\). The formulas in this case are:

\begin{align*}
\beta = \mbox{atan2}( -R_{2,0}, \cos(\beta) ) \\
\alpha = \mbox{atan2}( -R_{1,2}, R_{1,1} ) \\
\gamma = 0
\end{align*}

There are, of course, other options. Python code implementing these operations is below, along with unit testing for the degenerate and non-degerate cases.

import unittest
import numpy
def x_rotation( theta ):
"""Generate a 4x4 homogeneous rotation matrix about x-axis"""
c = numpy.cos(theta)
s = numpy.sin(theta)
return numpy.array([
[ 1, 0, 0, 0],
[ 0, c,-s, 0],
[ 0, s, c, 0],
[ 0, 0, 0, 1]
])
def y_rotation( theta ):
"""Generate a 4x4 homogeneous rotation matrix about y-axis"""
c = numpy.cos(theta)
s = numpy.sin(theta)
return numpy.array([
[ c, 0, s, 0],
[ 0, 1, 0, 0],
[-s, 0, c, 0],
[ 0, 0, 0, 1]
])
def z_rotation( theta ):
"""Generate a 4x4 homogeneous rotation matrix about z-axis"""
c = numpy.cos(theta)
s = numpy.sin(theta)
return numpy.array([
[ c,-s, 0, 0],
[ s, c, 0, 0],
[ 0, 0, 1, 0],
[ 0, 0, 0, 1]
])
def xyz_rotation( angles ):
"""Generate 4x4 homogeneous rotation in x then y then z order"""
return numpy.dot( z_rotation(angles[2]), numpy.dot( y_rotation(angles[1]), x_rotation(angles[0]) ) )
def xyz_rotation_angles( R, eps=1e-8, check=False ):
"""Back out the Euler angles that would lead to the given matrix"""
if check and numpy.linalg.norm( numpy.dot( R.transpose(), R ) - numpy.eye(4) ) > eps:
raise ValueError('Input matrix is not a pure rotation, R\'R != I')
cos_beta = numpy.sqrt( R[0,0]**2 + R[1,0]**2 )
if numpy.abs(cos_beta) > eps:
alpha = numpy.arctan2( R[2,1]/cos_beta, R[2,2]/cos_beta )
beta = numpy.arctan2(-R[2,0], cos_beta )
gamma = numpy.arctan2( R[1,0]/cos_beta, R[0,0]/cos_beta )
return numpy.array((alpha,beta,gamma))
else:
alpha = numpy.arctan2(-R[1,2], R[1,1] )
beta = numpy.arctan2(-R[2,0], cos_beta )
gamma = 0
return numpy.array((alpha,beta,gamma))
class test_euler_angles(unittest.TestCase):
def test_angles(self):
"""Do fuzz testing on arbitrary rotations, limiting beta to valid range"""
for i in range(1000):
alpha = (numpy.random.rand()-0.5)*numpy.pi*2
beta = (numpy.random.rand()-0.5)*numpy.pi*0.9999
gamma = (numpy.random.rand()-0.5)*numpy.pi*2
ang = xyz_rotation_angles( xyz_rotation([alpha,beta,gamma]))
self.assertAlmostEqual( alpha, ang[0] )
self.assertAlmostEqual( beta, ang[1] )
self.assertAlmostEqual( gamma, ang[2] )
def test_degeneracies(self):
"""Do fuzz testing on the degenerate condition of beta = +- pi/2"""
for i in range(1000):
alpha = (numpy.random.rand()-0.5)*numpy.pi*2
beta = numpy.sign(numpy.random.randn())*numpy.pi/2
gamma = (numpy.random.rand()-0.5)*numpy.pi*2
R = xyz_rotation((alpha,beta,gamma))
ang = xyz_rotation_angles( R )
R2 = xyz_rotation( ang )
# check that beta is recovered, gamma is set to zero and
# the rotation matrices match to high precision
self.assertAlmostEqual( beta, ang[1] )
self.assertAlmostEqual( ang[2], 0 )
for j in range(16):
self.assertAlmostEqual( R.ravel()[j], R2.ravel()[j] )
if __name__ == '__main__':
unittest.main()

#### Nov 02, 2018

Starting with the \(4\times4\) homogeneous transform matrix with parameters \(\beta = [ \theta_x, \theta_y, \theta_z, t_x, t_y, t_z ]^T\) where rotations are performed in XYZ order and using the following substitutions:

\begin{align*}
c_x = \cos(\theta_x) \\
s_x = \sin(\theta_x) \\
c_y = \cos(\theta_y) \\
s_y = \sin(\theta_y) \\
c_z = \cos(\theta_z) \\
s_z = \sin(\theta_z) \\
\end{align*}

the vector function mapping a point \(p = [p_x, p_y, p_z, 1]^T\) in the body coordinate system to a point in the world coordinate system \(w = [w_x, w_y, w_z, 1]^T\) is:

\begin{equation*}
\begin{bmatrix} w_x \\ w_y \\ w_z \\ 1 \end{bmatrix} = F( p, \beta ) =
\left[\begin{matrix}c_{y} c_{z} & - c_{x} s_{z} + c_{z} s_{x} s_{y} & c_{x} c_{z} s_{y} + s_{x} s_{z} & t_{x}\\c_{y} s_{z} & c_{x} c_{z} + s_{x} s_{y} s_{z} & c_{x} s_{y} s_{z} - c_{z} s_{x} & t_{y}\\- s_{y} & c_{y} s_{x} & c_{x} c_{y} & t_{z}\\0 & 0 & 0 & 1\end{matrix}\right]\begin{bmatrix} p_x \\ p_y \\ p_z \\ 1 \end{bmatrix}
\end{equation*}

and the Jacobian with respect to the parameters \(\beta\) is:

\begin{equation*}
J_F = \left[\begin{matrix}p_{y} \left(c_{x} c_{z} s_{y} + s_{x} s_{z}\right) + p_{z} \left(c_{x} s_{z} - c_{z} s_{x} s_{y}\right) & c_{x} c_{y} c_{z} p_{z} + c_{y} c_{z} p_{y} s_{x} - c_{z}p_{x} s_{y} & - c_{y} p_{x} s_{z} + p_{y} \left(- c_{x} c_{z} - s_{x} s_{y} s_{z}\right) + p_{z} \left(- c_{x} s_{y} s_{z} + c_{z} s_{x}\right) & 1 & 0 & 0\\p_{y} \left(c_{x} s_{y} s_{z} - c_{z} s_{x}\right) + p_{z} \left(- c_{x} c_{z} - s_{x} s_{y} s_{z}\right) & c_{x} c_{y} p_{z} s_{z} + c_{y} p_{y} s_{x} s_{z} - p_{x} s_{y} s_{z} & c_{y} c_{z} p_{x} + p_{y} \left(- c_{x} s_{z} + c_{z} s_{x} s_{y}\right) + p_{z} \left(c_{x} c_{z} s_{y} + s_{x} s_{z}\right) & 0 & 1 & 0\\c_{x} c_{y} p_{y} - c_{y} p_{z} s_{x} & - c_{x} p_{z} s_{y} - c_{y} p_{x} -p_{y} s_{x} s_{y} & 0 & 0 & 0 & 1\\0 & 0 & 0 & 0 & 0 & 0\end{matrix}\right]
\end{equation*}

Python code for these respective operations is below:

def make_transform( beta ):
c_x = numpy.cos(beta[0])
s_x = numpy.sin(beta[0])
c_y = numpy.cos(beta[1])
s_y = numpy.sin(beta[1])
c_z = numpy.cos(beta[2])
s_z = numpy.sin(beta[2])
t_x = beta[3]
t_y = beta[4]
t_z = beta[5]
return numpy.array([
[c_y*c_z, -c_x*s_z + c_z*s_x*s_y, c_x*c_z*s_y + s_x*s_z, t_x],
[c_y*s_z, c_x*c_z + s_x*s_y*s_z, c_x*s_y*s_z - c_z*s_x, t_y],
[-s_y, c_y*s_x, c_x*c_y, t_z],
[0, 0, 0, 1]
])
def make_transform_jacobian( beta, p ):
c_x = numpy.cos(beta[0])
s_x = numpy.sin(beta[0])
c_y = numpy.cos(beta[1])
s_y = numpy.sin(beta[1])
c_z = numpy.cos(beta[2])
s_z = numpy.sin(beta[2])
t_x = beta[3]
t_y = beta[4]
t_z = beta[5]
p_x = p[0]
p_y = p[1]
p_z = p[2]
return numpy.array([
[p_y*(c_x*c_z*s_y + s_x*s_z) + p_z*(c_x*s_z - c_z*s_x*s_y), c_x*c_y*c_z*p_z + c_y*c_z*p_y*s_x - c_z*p_x*s_y, -c_y*p_x*s_z + p_y*(-c_x*c_z - s_x*s_y*s_z) + p_z*(-c_x*s_y*s_z + c_z*s_x), 1, 0, 0],
[p_y*(c_x*s_y*s_z - c_z*s_x) + p_z*(-c_x*c_z - s_x*s_y*s_z), c_x*c_y*p_z*s_z + c_y*p_y*s_x*s_z - p_x*s_y*s_z, c_y*c_z*p_x + p_y*(-c_x*s_z + c_z*s_x*s_y) + p_z*(c_x*c_z*s_y + s_x*s_z), 0, 1, 0],
[c_x*c_y*p_y - c_y*p_z*s_x, -c_x*p_z*s_y - c_y*p_x - p_y*s_x*s_y, 0, 0, 0, 1],
[0, 0, 0, 0, 0, 0]
])

I generated these using sympy to build the transformations and used a finite-difference Jacobian function to check the output.

#### Sep 15, 2018

In the previous part of this post I introduced deconvolution and built up the core of a total-variation deconvolution algorithm based on ADMM.

def deconvolve_ADMM( K, B, num_iters=40, rho=1.0, gamma=0.1 ):
# store the image size
dim = B.shape
# pad out the kernel if its the wrong shape
if K.shape != dim:
raise ValueError('B and K must have same shape')
# define the two derivative operators
Dx = Kernel2D( [ 0, 0], [-1, 0], [-1.0, 1.0] )
Dy = Kernel2D( [-1, 0], [ 0, 0], [-1.0, 1.0] )
# define an initial solution estimate
I = numpy.zeros( dim )
# define the two splitting variables and lagrangr multipliers
Zx = Dx.mul( I )
Zy = Dy.mul( I )
Ux = numpy.zeros( Zx.shape )
Uy = numpy.zeros( Zy.shape )
# cache the necessary terms for the I update, need to circularly
# shift the kernel so it's DC spot lies at the corner
fK = numpy.fft.fft2( numpy.roll( K/numpy.sum(K), (dim[0]//2,dim[1]//2), axis=(0,1) ) )
fB = numpy.fft.fft2( B )
fDx = Dx.spectrum( dim )
fDy = Dy.spectrum( dim )
# build the numerator and denominator
num_init = numpy.conj( fK )*fB
den = numpy.conj( fK )*fK + rho*( numpy.conj(fDx)*fDx + numpy.conj(fDy)*fDy )
# define the L1 soft-thresholding operator
soft_threshold = lambda q: numpy.sign(q)*numpy.maximum( numpy.abs( q ) - gamma/rho, 0.0 )
# main solver loop
for iter in range( num_iters ):
print('ADMM iteration [%d/%d]'%(iter,num_iters))
# I-update
V = rho*( Dx.mul( Zx - Ux, trans=True) + Dy.mul( Zy - Uy, trans=True ) )
I = numpy.real( numpy.fft.ifft2( (num_init + numpy.fft.fft2(V))/den ) )
# Z-updates, cache the gradient filter results
tmp_x = Dx.mul( I )
tmp_y = Dy.mul( I )
Zx = soft_threshold( tmp_x + Ux )
Zy = soft_threshold( tmp_y + Uy )
# multiplier update
Ux += tmp_x - Zx
Uy += tmp_y - Zy
# return reconstructed result
return I

That is it.

#### Sep 14, 2018

Deconvolution is the process of removing blur from a signal. For images, there are two main types:

**Blind deconvolution** removes blur without specific knowledge of what the exact blur is. It generally requires estimating the blur kernel.
**Non-blind deconvolution** removes blur but is provided the blur kernel directly. It is an easier problem than blind deconvolution. Non-blind deconvolution is often used as a component of blind deconvolution algorithms.

Here I am focusing on non-blind deconvolution.

## Optimizing for \({\bf I}\)

Making the typical assumption of \(\epsilon\) being Gaussian noise, a solution for the sharp image \({\bf I}\) can be found by looking for an underlying sharp image that best explains, in a least-squares sense, the measured data \({\bf B}\).

\begin{equation*}
{\bf I} = \mbox{argmin}_{\bf I} \frac{1}{2} \| {\bf K} \otimes {\bf I} - {\bf B} \|_2^2
\end{equation*}

Denoting \(\mathcal{F}(\bf K)\) and \(\mathcal{F}^{-1}(\bf K)\) as the forward and inverse Fourier transform of \({\bf K}\) and \(*\) as complex conjugation, this equation can be solved very quickly in the Fourier domain:

\begin{equation*}
{\bf I} = \mathcal{F}^{-1}\left[ \frac{\mathcal{F}({\bf K})^*\mathcal{F}(\bf B)}{\mathcal{F}({\bf K})^2} \right]
\end{equation*}

Unfortunately the results are garbage. Zero or very low magnitude values of the Fourier transform of \({\bf K}\) cause divide-by-zero issues and amplify noise. These tiny values are due to convolution by \({\bf K}\) attenuating certain frequency bands to nothing. It makes sense intuitively since blurry images lose the sharp edges and fine details of focused images so the Fourier modes representing these are lost forever. Trying to undo the process is poorly defined and so the solve above just finds *anything* that minimizes the objective, whether it *looks* like an image or not.

It's possible to stabilize the above a bit by adding fudge factors here and there but the results are not very compelling. A better approach is to add regularizers to the optimization that favor specific types of underlying sharp images. One very effective set is the total variation which is incorporated below:

\begin{equation*}
{\bf I} = \mbox{argmin}_{\bf I} \frac{1}{2} \| {\bf K} \otimes {\bf I} - {\bf B} \|_2^2 + \gamma \| \frac{\partial}{\partial x} {\bf I} \|_1 + \gamma \| \frac{\partial}{\partial y} {\bf I} \|_1
\end{equation*}

The two extra terms penalize the gradients of the image and reach a zero value only for images that are constant. Using a 1-norm makes them less sensitive to large jumps in value than a 2-norm would be which makes these priors favor piecewise constant reconstructions. This is qualitatively similar to natural images that feature piecewise smooth features.

The only problem is that these 1-norm terms make the optimization much more difficult and costly to solve. However, the effort is worth it.

## Formulating the Optimization with ADMM

The problem with solving the optimization above is that the 1-norm terms are non-smooth. This is challenging for optimization methods. Fortunately, there are some very effective tools for solving 1-norm problems in isolation and a framework call the *Alternating Direction Method of Multipliers* can be used to transform intractible monolithic objective functions (such as the one above) into several coupled simpler problems.

By being careful about which terms go where, this can result in very efficient solves for problems involving complex priors. What follows is only one option for the above problem.

The overall goal is to split the \(L_1\) solves into separate pixel-wise solves. This is done by introducing new variables and constraints that make the problem look more complex but actually help. The data term will stay as a coupled \(L_2\) problem and will absorb the coupling terms as well (which are also \(L_2\)) and serve to stabilize the solve.

The first step is to isolate the \(L_1\) terms. The differential operators in the priors have the effect of coupling all pixels to each other, but the fast solvers for \(L_1\) problems expect terms that look like \(\|{\bf A}\|_1\) for some image \({\bf A}\). This can be accomplished by introducing splitting variables \({\bf Z}_x,{\bf Z}_y\) with appropriate constraints.

\begin{align*}
{\bf I} = \mbox{argmin} & & \frac{1}{2} \| {\bf K} \otimes {\bf I} - {\bf B} \|_2^2 + \gamma \| {\bf Z}_x \|_1 + \gamma \| {\bf Z}_y \|_1 \\
\mbox{subject to} & & \frac{\partial}{\partial x} {\bf I} - {\bf Z}_x = 0 \\
& & \frac{\partial}{\partial y} {\bf I} - {\bf Z}_y = 0
\end{align*}

It is also important to make sure that the \({\bf Z}_x, {\bf Z}_y\) terms appear without complex linear operators in the constraints. This ensures that the terms will diagonalize to a large set of pixelwise 1D optimizations rather than couple together.

With the splitting variables introduced, the constraints can be incorporated into the objective via penalty terms with (scaled) Lagrange multipliers \({\bf U}_x,{\bf U}_y\). The multipliers keep the constraints stiff. Without the multipliers, the high weights needed on the penalty terms to enforce the constraints can dominate the solves.

Applying this step transform back from a constrained problem to a (much more complicated) unconstrained problem again:

\begin{equation*}
{\bf I} = \mbox{argmin} \frac{1}{2} \| {\bf K} \otimes {\bf I} - {\bf B} \|_2^2 + \frac{\rho}{2} \| \frac{\partial}{\partial x} {\bf I} - {\bf Z}_x + {\bf U}_x \|_2^2 + \frac{\rho}{2} \| \frac{\partial}{\partial y} {\bf I} - {\bf Z}_y + {\bf U}_y \|_2^2 + \gamma \| {\bf Z}_x \|_1 + \gamma \| {\bf Z}_y \|_1
\end{equation*}

This looks like a mess, but is just the original data term, the two added \(L_1\) terms and two additional \(L_2\) coupling terms. Calling the whole thing \(F({\bf I},{\bf Z}_x, {\bf Z}_y, {\bf U}_x, {\bf U}_y)\) the ADMM algorithm solves this by the following steps:

- Updates \({\bf I}\) by solving \({\bf I} = \mbox{argmin}_{\bf I} F({\bf I},{\bf Z}_x, {\bf Z}_y, {\bf U}_x, {\bf U}_y)\)
- Updates \({\bf Z}_x\) by solving \({\bf Z}_x = \mbox{argmin}_{{\bf Z}_x} F({\bf I},{\bf Z}_x, {\bf Z}_y, {\bf U}_x, {\bf U}_y)\)
- Updates \({\bf U}_x\) by setting \({\bf U}_x = {\bf U}_x + \frac{\partial}{\partial x} {\bf I} - {\bf Z}_x\)
- Repeats the last two steps, appropriately modified, for \({\bf Z}_y\) and \({\bf U}_y\) respectively.

### The \({\bf I}\) Update

The first step of ADMM is to update \({\bf I}\). This is a straightforward \(L_2\) problem since the only terms that \({\bf I}\) appears is in terms with a 2-norm. Denoting \(\frac{\partial}{\partial x}{\bf I}\) as \({\bf D}_x \otimes {\bf I}\) and likewise for \(\frac{\partial}{\partial y}\), the solve ends up being:

\begin{equation*}
{\bf I} = \mbox{argmin} \frac{1}{2} \| {\bf K} \otimes {\bf I} - {\bf B} \|_2^2 + \frac{\rho}{2} \| {\bf D}_x \otimes {\bf I} - {\bf Z}_x + {\bf U}_x \|_2^2 + \frac{\rho}{2} \| {\bf D}_y \otimes {\bf I} - {\bf Z}_y + {\bf U}_y \|_2^2
\end{equation*}

All the terms are effectively the same: a minimization over the residual of a kernel convolved the the target image with respect to a measured value. The minimum is found by expanding out the term and finding the target image that sets the gradient of the expanded term to zero.

The solution for the first term in the Fourier domain is earlier in the post as:

\begin{equation*}
{\bf I} = \mathcal{F}^{-1}\left[ \frac{\mathcal{F}({\bf K})^*\mathcal{F}(\bf B)}{\mathcal{F}({\bf K})^2} \right]
\end{equation*}

The remaining terms follow the same pattern and sum in the numerator and denominator of the solve:

\begin{equation*}
{\bf I} = \mathcal{F}^{-1}\left[ \frac{
\mathcal{F}({\bf K})^*\mathcal{F}(\bf B)
+\rho \mathcal{F}({\bf D}_x)^*\mathcal{F}({\bf Z}_x - {\bf U}_x)
+\rho \mathcal{F}({\bf D}_y)^*\mathcal{F}({\bf Z}_y - {\bf U}_y)
}{\mathcal{F}({\bf K})^2 + \rho \mathcal{F}({\bf D}_x)^2 + \rho \mathcal{F}({\bf D}_y)^2} \right]
\end{equation*}

This involves 9 Fourier transforms, but only the terms containing \({\bf Z}_x,{\bf U}_x\) actually change during the solve (similarly for the y subscripts). By caching the first term of the numerator, the entire denominator, as well as \(\mathcal{F}({\bf D}_x)\mbox{ & }\mathcal{F}({\bf D}_y)\) this can be reduced to three transforms. It can further be reduced by evaluating the derivative filters \({\bf D}_x\mbox{ & }{\bf D}_y\) in the spatial domain, since they are very inexpensive finite difference filters. The following gets the \({\bf I}\) update down to one forward and one inverse transform once contants are cached:

\begin{align*}
{\bf V} = \rho \left( {\bf D}^T_x \otimes ({\bf Z}_x - {\bf U}_x) + {\bf D}^T_y \otimes ({\bf Z}_y - {\bf U}_y)\right) \\
{\bf I} = \mathcal{F}^{-1}\left[ \frac{
\mathcal{F}({\bf K})^*\mathcal{F}(\bf B) + \mathcal{F}({\bf V})
}{\mathcal{F}({\bf K})^2 + \rho \mathcal{F}({\bf D}_x)^2 + \rho \mathcal{F}({\bf D}_y)^2} \right]
\end{align*}

**Important note:** the \({\bf D}^T_x \otimes\) terms in \({\bf V}\) use the **transposed** filters (mixing notation terribly). This is the complex conjugate of the filter in the Fourier domain version, or matrix transpose if the convolution operation were represented in matrix form. Expressed as spatial domain convolution, the transpose simply requires mirroring the kernels horizontally and vertically about the central pixel.

### The \({\bf Z}\) Updates

The next step of ADMM is to update the splitting variable \({\bf Z}_x\mbox{ & }{\bf Z}_y\). These terms only appear in one coupling terms and one \(L_1\) term each. The updates for the two \({\bf Z}\) and two \({\bf U}\) variables are the same except the specific filter that is used, so I'll only cover one. Here's the update needed for \({\bf Z}_x\):

\begin{equation*}
{\bf Z}_x = \mbox{argmin}_{{\bf Z}_x} \frac{\rho}{2} \| {\bf D}_x \otimes {\bf I} - {\bf Z}_x + {\bf U}_x \|_2^2 + \gamma \| {\bf Z}_x \|_1
\end{equation*}

It's important to note that there are no operators applied to \({\bf Z}_x\), meaning that each pixel can be processed independently. The terms can be rearranged and the weight factors collected onto the \(L_1\) term to give the equivalent minimization:

\begin{equation*}
{\bf Z}_x = \mbox{argmin}_{{\bf Z}_x} \frac{\gamma}{\rho} \| {\bf Z}_x \|_1 + \frac{1}{2} \| {\bf Z}_x - ({\bf D}_x \otimes {\bf I} + {\bf U}_x ) \|_2^2
\end{equation*}

However, a complication is (still) the \(L_1\) term. This is non-smooth and tricky to deal with but the form above allows a non-smooth optimization method known as **proximal operators** to be used. The proximal operator of a scaled function \(\lambda f(q)\) is defined as:

\begin{equation*}
\mbox{prox}_{\lambda f}({\bf q}) = \mbox{argmin}_{\bf p} f({\bf p}) + \frac{1}{2\lambda} \| {\bf p} - {\bf q} \|_2^2
\end{equation*}

The similarities with the \({\bf Z}_x\) minimization are strong. By setting \(f({\bf Z}_x)=\|{\bf Z}_x\|_1\mbox{ & }\lambda = \frac{\gamma}{\rho}\mbox{ & }{\bf q} = {\bf D}_x \otimes {\bf I} + {\bf U}_x\) the proximal operator is recovered. This is great because the closed form solution for \(\mbox{prox}_{\lambda\|{\bf p}\|_1}(\bf{q})\) is known and trivial to implement:

\begin{equation*}
\mbox{prox}_{\lambda\|{\bf p}\|_1}({\bf q}) = \mbox{sign}({\bf q}) \mbox{max}( 0, |{\bf q}| - \lambda)
\end{equation*}

This is known as the *soft-thresholding operator*, or \(L_1\) *shrinkage operator* and is used **everywhere**. It can be implemented in python as:

def soft_threshold( q, lam ):
return numpy.sign(q)*numpy.maximum( numpy.abs(q) - lam, 0.0 )

In my view, it's the most potent line of code that exists in signal processing.

### The \({\bf U}\) Updates

The \({\bf U}\) updates simply add the current constraint violation to the Lagrange multipliers \({\bf U}\). It's kind of like the integral term in a PID controller.

The update was actually given already in the overview above but is repeated here for completeness:

\begin{align*}
{\bf U}_x = {\bf U}_x + {\bf D}_x \otimes {\bf I} - {\bf Z}_x \\
{\bf U}_y = {\bf U}_y + {\bf D}_x \otimes {\bf I} - {\bf Z}_y
\end{align*}

This is identical to the formula ealier, I've just used the \({\bf D}_x \otimes {\bf I}\) notation rather than \(\frac{\partial}{\partial x} {\bf I}\).

## Summary

That's all the math that's needed. It looks like a lot but ends up being surprisingly little code. In the next part I work through the implementation.

#### Sep 09, 2018

In image reconstruction tasks, it's often convenient to perform filtering operations in either the spatial or frequency domain. Both have their advantages and disadvantages. For the spatial domain, small filters such as finite differences can be evaluated extremely quickly and include local coefficients but objective functions often result in large systems of equations. In contrast, many objective functions diagonalize to yield pixel-wise solutions in the frequency domain but spatial variation cannot be incorporated directly.

As a result, it's common to express reconstruction tasks with individual terms in each domain. The domains are then coupled through penalties or constraints that can be expressed naturally either domain and lead to efficient solvers.

By way of example: in the total-variation regularized inpainting example below, the data term expresses a least-squares penalty on a (fixed) random selection of pixels in the reconstruction \({\bf I}\) and measurements \({\bf Y}\) via the selection mask \({\bf S}\) (\(\odot\) is pixelwise multiplication). This term cannot be expressed efficiently in the Fourier domain. However the regularizers (e.g. \(\lambda \| \frac{\partial}{\partial x} {\bf I} \|_1\)) **can and should be** since otherwise they couple every pixel in the image together. This is bad because it results in very costly solves.

\begin{equation*}
{\bf I} = \mbox{argmin}_{\bf I} \frac{1}{2} \| S \odot \left( {\bf I} - {\bf Y} \right) \|_2^2
+ \lambda \| \frac{\partial}{\partial x} {\bf I} \|_1 + + \lambda \| \frac{\partial}{\partial y} {\bf I} \|_1
\end{equation*}

Solving this efficiently requires splitting the problem into separate solve that couple through constraints and Lagrange multipliers (or more simply through penalties). How to do that is an entire topic in itself but the key thing is that **filtering is performed in both domains and must be consistent** to produce sensible results. If this is not the case, all sorts of weird things happen: images shifting each iteration, convergence stalling, etc.

This sounds (and is) easy but requires a bit of attention to detail.

## Brass Tacks

The starting point here is a definition of a sparse kernel. By sparse, I mean a kernel with very few non-zero weights, generally less than 10-20 or so. This includes common filters like Laplacians, edge filters and finite difference stencils.

In the following code, a kernel is defined by three equal length lists containing the row-offsets, column-offsets and filter values for the kernel. In the code, the user specifies the **correlation kernel**; this produces results that match typical finite-difference kernels when the correlate_spatial or correlate_fourier methods are called. Here is the code for the kernel class:

import numpy
class SparseKernel2D:
"""Represents a kernel for a 2D convolution operation
Example:
# make a first-order finite difference along the column axis
K = SparseKernel2D( [0,0], [-1,0], [-1.0, 1.0] )
"""
def __init__( self, offy, offx, vals ):
"""Represents a kernel for a 2D convolution operation
Args:
offy (integer array/list) vertical offsets of non-zero kernel values
offx (integer array/list) horizontal offsets of non-zero kernel values
vals (float array/list) non-zero kernel values
Raises:
ValueError whenever offx, offx, vals do not have the same length
"""
if len(offy) != len(offx) or len(vals) != len(offx):
raise ValueError('offx, offy and vals must be 1D array-like with the same size')
self.offx = offx
self.offy = offy
self.vals = vals
def image( self, dim ):
"""Returns an image represenation of the kernel at the specified dimensions with the kernel centered.
Args:
dim (integer tuple) output image dimensions [height,width]
Raises:
ValueError when dim is not 2-element
Returns:
height x width image of float32 elements
"""
if len(dim) != 2:
raise ValueError('Expected a 2-element integer tuple/list for image dimensions')
cen = ( int(dim[0]/2), int(dim[1]/2) )
out = numpy.zeros( dim, dtype=numpy.float32 )
for dx,dy,val in zip( self.offx, self.offy, self.vals):
out[cen[0]+dy,cen[1]+dx] = val
return out
def spectrum( self, dim ):
"""Returns a spectrum representation of the kernel at the specified dimensions. The returned
kernel is *not* centered, i.e. has it's DC component at the corner.
Args:
dim (integer tuple) output spectrum dimensions [height,width]
Raises:
ValueError when dim is not 2-element
Returns:
height x width complex image
"""
return numpy.fft.fft2( numpy.roll( self.image(dim), [int(dim[0]/2+0.5),int(dim[1]/2+0.5)], axis=[0,1] ) )
def convolve_spatial( self, I ):
"""Convolves the input image by the kernel in the spatial domain
Args:
I (2d float32 array) input image to convolve with kernel
Raises:
ValueError when input is not a 2D ndarray or I.dtype is not numpy.floating subclass
Returns:
Convolution results as 2D ndarray of I.dtype elements
"""
if not isinstance(I,numpy.ndarray) or I.ndim != 2 or not numpy.issubdtype(I.dtype,numpy.floating):
raise ValueError('Expected a 2D ndarray or float32 values')
result = numpy.zeros( I.shape, dtype=I.dtype )
for dx,dy,val in zip(self.offx, self.offy, self.vals):
result += val*numpy.roll( I, [dy,dx], axis=[0,1] )
return result
def correlate_spatial( self, I ):
"""Correlates the input image with the kernel in the spatial domain
"""
if not isinstance(I,numpy.ndarray) or I.ndim != 2 or not numpy.issubdtype(I.dtype,numpy.floating):
raise ValueError('Expected a 2D ndarray or float32 values')
result = numpy.zeros( I.shape, dtype=I.dtype )
for dx,dy,val in zip(self.offx, self.offy, self.vals):
result += val*numpy.roll( I, [-dy,-dx], axis=[0,1] )
return result
def convolve_fourier( self, I ):
"""Convolves the input image by the kernel in the Fourier domain
Args:
I (2d float32 array) input image to convolve with kernel
Raises:
ValueError when input is not a 2D ndarray or I.dtype is not numpy.floating subclass
Returns:
Convolution results as 2D ndarray of I.dtype elements
"""
if not isinstance(I,numpy.ndarray) or I.ndim != 2 or not numpy.issubdtype(I.dtype,numpy.floating):
raise ValueError('Expected a 2D ndarray or float32 values')
fK = self.spectrum( I.shape )
fI = numpy.fft.fft2( I )
return numpy.real( numpy.fft.ifft2( fK*fI ) )
def correlate_fourier( self, I ):
"""Correlates the input image by the kernel in the Fourier domain
Args:
I (2d float32 array) input image to convolve with kernel
Raises:
ValueError when input is not a 2D ndarray or I.dtype is not numpy.floating subclass
Returns:
Convolution results as 2D ndarray of I.dtype elements
"""
if not isinstance(I,numpy.ndarray) or I.ndim != 2 or not numpy.issubdtype(I.dtype,numpy.floating):
raise ValueError('Expected a 2D ndarray or float32 values')
fK = self.spectrum( I.shape )
fI = numpy.fft.fft2( I )
return numpy.real( numpy.fft.ifft2( numpy.conj(fK)*fI ) )

## Testing & Validation

To ensure that indexing is done correctly, it's important to check that the kernel can represent an identity operation correctly for odd and even image sizes in both Fourier and spatial domains. After that, it's important that the convolution and correlation functions return the correct results.

This last point depends on whether the convolution or correlation filter is being defined. The difference between the two is mirroring horizontally and vertically. Personally, I find it most natural to define the correlation filter which coincides with the typical finite difference stencils. Generating some analytic data as well as some known finite difference kernels allows correct operation to be confirmed:

def TestSparseKernel2D():
"""Runs a series of test functions on the SparseKernel2D class"""
# local helper function that will raise an error if the
# results are insufficiently accurate
def MSE( x, ref ):
val = numpy.mean( numpy.power( x-ref, 2 ) )
if val > 1e-6:
raise ValueError('Values do not match!')
return val
# define a mean-squared error function
#MSE = lambda I, Ref: numpy.mean( numpy.power( I-Ref, 2 ) )
# generate an identity kernel
ident = SparseKernel2D( [0],[0],[1.0] )
# generate some test images
x,y = numpy.meshgrid( numpy.arange(0.,200.), numpy.arange(0.,100.) )
z = x+y
# check that convolving with the identity kernel
# reproduces the input for even mesh sizes
conv_z_s = ident.convolve_spatial( z )
conv_z_f = ident.convolve_fourier( z )
print('Identity test, even sizes: ',z.shape)
print('\tMSE spatial: %f'%MSE(conv_z_s,z) )
print('\tMSE fourier: %f'%MSE(conv_z_f,z) )
# check that convolving with the identity kernel
# reproduces the input for odd/even mesh sizes
tz = z[1:,:]
conv_z_s = ident.convolve_spatial( tz )
conv_z_f = ident.convolve_fourier( tz )
print('Identity test, odd/even sizes: ',tz.shape)
print('\tMSE spatial: %f'%MSE(conv_z_s,tz) )
print('\tMSE fourier: %f'%MSE(conv_z_f,tz) )
# check that convolving with the identity kernel
# reproduces the input for odd/even mesh sizes
tz = z[:,1:]
conv_z_s = ident.convolve_spatial( tz )
conv_z_f = ident.convolve_fourier( tz )
print('Identity test, even/odd sizes: ',tz.shape)
print('\tMSE spatial: %f'%MSE(conv_z_s,tz) )
print('\tMSE fourier: %f'%MSE(conv_z_f,tz) )
# check that convolving with the identity kernel
# reproduces the input for odd/odd mesh sizes
tz = z[:-1,1:]
conv_z_s = ident.convolve_spatial( tz )
conv_z_f = ident.convolve_fourier( tz )
print('Identity test, odd/odd sizes: ',tz.shape)
print('\tMSE spatial: %f'%MSE(conv_z_s,tz) )
print('\tMSE fourier: %f'%MSE(conv_z_f,tz) )
# generate first order finite difference formula
# try convolving in both the spatial and frequency
# domains
ddx = SparseKernel2D( [0,0],[-1,0],[-1.0,1.0] )
ddx_z_s = ddx.correlate_spatial( z )[1:-2,1:-2]
ddx_z_f = ddx.correlate_fourier( z )[1:-2,1:-2]
ones = numpy.ones( ddx_z_s.shape, dtype=numpy.float32 )
print('First order finite-difference d(x+y)/dx')
print('\tMSE spatial: %f' % MSE(ddx_z_s,ones) )
print('\tMSE fourier: %f' % MSE(ddx_z_f,ones) )
# generate first order finite difference formula
# try convolving in both the spatial and frequency
# domains
ddy = SparseKernel2D( [-1,0],[0,0],[-1.0,1.0] )
ddy_z_s = ddy.correlate_spatial( z )[1:-2,1:-2]
ddy_z_f = ddy.correlate_fourier( z )[1:-2,1:-2]
ones = numpy.ones( ddy_z_s.shape, dtype=numpy.float32 )
print('First order finite-difference d(x+y)/dy')
print('\tMSE spatial: %f' % MSE(ddy_z_s,ones) )
print('\tMSE fourier: %f' % MSE(ddy_z_f,ones) )
# now try the convolution functions, this simply
# flips the kernel about the central pixel
# horizontally and vertically. for the filters
# defined above, this negates the results
ddx_z_s = ddx.convolve_spatial( z )[1:-2,1:-2]
ddx_z_f = ddx.convolve_fourier( z )[1:-2,1:-2]
ones = numpy.ones( ddx_z_s.shape, dtype=numpy.float32 )
print('Convolved first order finite-difference d(x+y)/dx')
print('\tMSE spatial: %f' % MSE(ddx_z_s,-ones) )
print('\tMSE fourier: %f' % MSE(ddx_z_f,-ones) )
# generate first order finite difference formula
# try convolving in both the spatial and frequency
# domains
ddy_z_s = ddy.convolve_spatial( z )[1:-2,1:-2]
ddy_z_f = ddy.convolve_fourier( z )[1:-2,1:-2]
ones = numpy.ones( ddy_z_s.shape, dtype=numpy.float32 )
print('Convolved first order finite-difference d(x+y)/dy')
print('\tMSE spatial: %f' % MSE(ddy_z_s,-ones) )
print('\tMSE fourier: %f' % MSE(ddy_z_f,-ones) )

This prints:

Identity test, even sizes: (100, 200)
MSE spatial: 0.000000
MSE fourier: 0.000000
Identity test, odd/even sizes: (99, 200)
MSE spatial: 0.000000
MSE fourier: 0.000000
Identity test, even/odd sizes: (100, 199)
MSE spatial: 0.000000
MSE fourier: 0.000000
Identity test, odd/odd sizes: (99, 199)
MSE spatial: 0.000000
MSE fourier: 0.000000
First order finite-difference d(x+y)/dx
MSE spatial: 0.000000
MSE fourier: 0.000000
First order finite-difference d(x+y)/dy
MSE spatial: 0.000000
MSE fourier: 0.000000
Convolved first order finite-difference d(x+y)/dx
MSE spatial: 0.000000
MSE fourier: 0.000000
Convolved first order finite-difference d(x+y)/dy
MSE spatial: 0.000000
MSE fourier: 0.000000

#### Sep 08, 2018

Bilteratal filtering is a process for removing noise from images. Like most filters, each output pixel is produced as a weighted combination of input pixels. However unlike typical filters, the weights used to combine input pixels are a function of the input pixels themselves rather than fixed in advance. This makes it a non-linear filter and is key to its effectiveness.

Given an input grayscale image \({\bf I}(y,x)\) where \(x\) & \(y\) are column and row indices, bilateral filtering defines the output image \({\bf F}\) as:

$$
{\bf F}(y,x) = \frac{ \sum_{r=y-w}^{y+w} \sum_{s=x-w}^{x+w} {\bf I}(r,s) \mathcal{N}_s(r-y,s-x) \mathcal{N}_v \left( {\bf I}(y,x)-{\bf I}(s,r) \right) }{ \sum_{r=y-w}^{y+w} \sum_{s=x-w}^{x+w} \mathcal{N}_s(r-y,s-x) \mathcal{N}_v \left( {\bf I}(y,x)-{\bf I}(s,r) \right)}
$$

This looks more involved than it actually is:

- The numerator and denominator both sum over a window that is \(2w+1 \times 2w+1\) pixels in size, centered on \(y\),\(x\). The pixel coordinates \(r\),\(s\) reference each pixel within that window.
- The pixel values of the input are combined via a weighted sum using the
*spatial proximity* in \(\mathcal{N}_s(\delta_y,\delta_x) = e^{ -\frac{\delta_x^2+\delta_y^2}{2\sigma_s^2} }\), as in standard Gaussian filtering.
- An additional factor based on the
*value proximity* \(\mathcal{N}_v \left( v \right) = e^{ -\frac{v^2}{2 \sigma_v^2} }\) is used to modify the weights. This is the nonlinear part of the filter and causes pixels with similar value to the central pixel to be weighted more strongly.
- The entire sum is normalized by the sum of the product of the
*spatial proximity* and *value proximity* weights in order to produce an output with proper range.

The mathematical definition is good for understanding but a straightforward implementation results in four nested loops: two for the rows/columns and two for the offsets in each window. In pure python this leads to considerable overhead. A better approach is to vectorize the operations by:

- Only looping over the offsets for the window
- Generating the
*spatial proximity* weights once per offset
- Shifting the entire input image by these offsets
- Generating the
*value proximity* weights in builk using the shifted image
- Accumulating an unnormalized result image and sum of weights image in bulk
- Returning the pixel-wise quotient of the result and weight sum images

Code that implements this is below:

def filter_bilateral( img_in, sigma_s, sigma_v, reg_constant=1e-8 ):
"""Simple bilateral filtering of an input image
Performs standard bilateral filtering of an input image. If padding is desired,
img_in should be padded prior to calling
Args:
img_in (ndarray) monochrome input image
sigma_s (float) spatial gaussian std. dev.
sigma_v (float) value gaussian std. dev.
reg_constant (float) optional regularization constant for pathalogical cases
Returns:
result (ndarray) output bilateral-filtered image
Raises:
ValueError whenever img_in is not a 2D float32 valued numpy.ndarray
"""
# check the input
if not isinstance( img_in, numpy.ndarray ) or img_in.dtype != 'float32' or img_in.ndim != 2:
raise ValueError('Expected a 2D numpy.ndarray with float32 elements')
# make a simple Gaussian function taking the squared radius
gaussian = lambda r2, sigma: (numpy.exp( -0.5*r2/sigma**2 )*3).astype(int)*1.0/3.0
# define the window width to be the 3 time the spatial std. dev. to
# be sure that most of the spatial kernel is actually captured
win_width = int( 3*sigma_s+1 )
# initialize the results and sum of weights to very small values for
# numerical stability. not strictly necessary but helpful to avoid
# wild values with pathological choices of parameters
wgt_sum = numpy.ones( img_in.shape )*reg_constant
result = img_in*reg_constant
# accumulate the result by circularly shifting the image across the
# window in the horizontal and vertical directions. within the inner
# loop, calculate the two weights and accumulate the weight sum and
# the unnormalized result image
for shft_x in range(-win_width,win_width+1):
for shft_y in range(-win_width,win_width+1):
# compute the spatial weight
w = gaussian( shft_x**2+shft_y**2, sigma_s )
# shift by the offsets
off = numpy.roll(img_in, [shft_y, shft_x], axis=[0,1] )
# compute the value weight
tw = w*gaussian( (off-img_in)**2, sigma_v )
# accumulate the results
result += off*tw
wgt_sum += tw
# normalize the result and return
return result/wgt_sum

An example of calling the function is below:

# use opencv for image loading
import cv2
import numpy
# function definition here....
# read the lena image, convert to float and scale to [0,1]
I = cv2.imread('images/lena.png', cv2.IMREAD_UNCHANGED ).astype(numpy.float32)/255.0
# bilateral filter the image
B = numpy.stack([
filter_bilateral( I[:,:,0], 10.0, 0.1 ),
filter_bilateral( I[:,:,1], 10.0, 0.1 ),
filter_bilateral( I[:,:,2], 10.0, 0.1 )], axis=2 )
# stack the images horizontally
O = numpy.hstack( [I,B] )
# write out the image
cv2.imwrite( 'images/lena_bilateral.png', out*255.0 )

Which produces the following, the input image is one the left and the output image is on the right.

The only downside to this is the speed. Evaluating the *value proximity* weights is costly. It can be sped up by discretizing and interpolating them from a lookup table. Since the filter also preserves edges very well, there is often the temptation to use very large spatial filter sizes. In the example above I used a value of 10 which corresponds to a **61x61** pixel kernel. This makes the filter a prime candidate to implement in C++ or CUDA for efficiency.