# James Gregson's Website

## Deconvolution via ADMM in Python - Part 1: Math

Deconvolution is the process of removing blur from a signal. For images, there are two main types:

• Blind deconvolution removes blur without specific knowledge of what the exact blur is. It generally requires estimating the blur kernel.
• Non-blind deconvolution removes blur but is provided the blur kernel directly. It is an easier problem than blind deconvolution. Non-blind deconvolution is often used as a component of blind deconvolution algorithms.

Here I am focusing on non-blind deconvolution.

## Image Formation Model

When the blur kernel is the same over the entire image, blurry images can be modeled mathematically as a convolution of the blur kernel with an unknown sharp image, i.e.:

\begin{equation*} {\bf B} = {\bf K} \otimes {\bf I} + \epsilon \end{equation*}

The blur kernel $${\bf K}$$ is the image that is produced in the blurry image by a single point (pixel) of the unknown sharp image $${\bf I}$$. It's often called the point spread function and, with the blurry image $${\bf B}$$, is an input to the deconvolution algorithm. The last term $$\epsilon$$ models noise in the captured signal and is often assumed to be Gaussian, although in practice is often not.

The convolution operation is linear, so in an ideal world we would just apply it's inverse to recover the sharp image. What makes deconvolution challenging is that the blur kernel usually destroys at least some information. This means that the inverse does not exist. Consequently, solving a deconvolution problems actually means finding one of the infinitely many possible sharp images that, convolved with the blur kernel, would reproduce the measured blurry image.

## Optimizing for $${\bf I}$$

Making the typical assumption of $$\epsilon$$ being Gaussian noise, a solution for the sharp image $${\bf I}$$ can be found by looking for an underlying sharp image that best explains, in a least-squares sense, the measured data $${\bf B}$$.

\begin{equation*} {\bf I} = \mbox{argmin}_{\bf I} \frac{1}{2} \| {\bf K} \otimes {\bf I} - {\bf B} \|_2^2 \end{equation*}

Denoting $$\mathcal{F}(\bf K)$$ and $$\mathcal{F}^{-1}(\bf K)$$ as the forward and inverse Fourier transform of $${\bf K}$$ and $$*$$ as complex conjugation, this equation can be solved very quickly in the Fourier domain:

\begin{equation*} {\bf I} = \mathcal{F}^{-1}\left[ \frac{\mathcal{F}({\bf K})^*\mathcal{F}(\bf B)}{\mathcal{F}({\bf K})^2} \right] \end{equation*}

Unfortunately the results are garbage. Zero or very low magnitude values of the Fourier transform of $${\bf K}$$ cause divide-by-zero issues and amplify noise. These tiny values are due to convolution by $${\bf K}$$ attenuating certain frequency bands to nothing. It makes sense intuitively since blurry images lose the sharp edges and fine details of focused images so the Fourier modes representing these are lost forever. Trying to undo the process is poorly defined and so the solve above just finds anything that minimizes the objective, whether it looks like an image or not.

It's possible to stabilize the above a bit by adding fudge factors here and there but the results are not very compelling. A better approach is to add regularizers to the optimization that favor specific types of underlying sharp images. One very effective set is the total variation which is incorporated below:

\begin{equation*} {\bf I} = \mbox{argmin}_{\bf I} \frac{1}{2} \| {\bf K} \otimes {\bf I} - {\bf B} \|_2^2 + \gamma \| \frac{\partial}{\partial x} {\bf I} \|_1 + \gamma \| \frac{\partial}{\partial y} {\bf I} \|_1 \end{equation*}

The two extra terms penalize the gradients of the image and reach a zero value only for images that are constant. Using a 1-norm makes them less sensitive to large jumps in value than a 2-norm would be which makes these priors favor piecewise constant reconstructions. This is qualitatively similar to natural images that feature piecewise smooth features.

The only problem is that these 1-norm terms make the optimization much more difficult and costly to solve. However, the effort is worth it.

## Formulating the Optimization with ADMM

The problem with solving the optimization above is that the 1-norm terms are non-smooth. This is challenging for optimization methods. Fortunately, there are some very effective tools for solving 1-norm problems in isolation and a framework call the Alternating Direction Method of Multipliers can be used to transform intractible monolithic objective functions (such as the one above) into several coupled simpler problems.

By being careful about which terms go where, this can result in very efficient solves for problems involving complex priors. What follows is only one option for the above problem.

The overall goal is to split the $$L_1$$ solves into separate pixel-wise solves. This is done by introducing new variables and constraints that make the problem look more complex but actually help. The data term will stay as a coupled $$L_2$$ problem and will absorb the coupling terms as well (which are also $$L_2$$) and serve to stabilize the solve.

The first step is to isolate the $$L_1$$ terms. The differential operators in the priors have the effect of coupling all pixels to each other, but the fast solvers for $$L_1$$ problems expect terms that look like $$\|{\bf A}\|_1$$ for some image $${\bf A}$$. This can be accomplished by introducing splitting variables $${\bf Z}_x,{\bf Z}_y$$ with appropriate constraints.

\begin{align*} {\bf I} = \mbox{argmin} & & \frac{1}{2} \| {\bf K} \otimes {\bf I} - {\bf B} \|_2^2 + \gamma \| {\bf Z}_x \|_1 + \gamma \| {\bf Z}_y \|_1 \\ \mbox{subject to} & & \frac{\partial}{\partial x} {\bf I} - {\bf Z}_x = 0 \\ & & \frac{\partial}{\partial y} {\bf I} - {\bf Z}_y = 0 \end{align*}

It is also important to make sure that the $${\bf Z}_x, {\bf Z}_y$$ terms appear without complex linear operators in the constraints. This ensures that the terms will diagonalize to a large set of pixelwise 1D optimizations rather than couple together.

With the splitting variables introduced, the constraints can be incorporated into the objective via penalty terms with (scaled) Lagrange multipliers $${\bf U}_x,{\bf U}_y$$. The multipliers keep the constraints stiff. Without the multipliers, the high weights needed on the penalty terms to enforce the constraints can dominate the solves.

Applying this step transform back from a constrained problem to a (much more complicated) unconstrained problem again:

\begin{equation*} {\bf I} = \mbox{argmin} \frac{1}{2} \| {\bf K} \otimes {\bf I} - {\bf B} \|_2^2 + \frac{\rho}{2} \| \frac{\partial}{\partial x} {\bf I} - {\bf Z}_x + {\bf U}_x \|_2^2 + \frac{\rho}{2} \| \frac{\partial}{\partial y} {\bf I} - {\bf Z}_y + {\bf U}_y \|_2^2 + \gamma \| {\bf Z}_x \|_1 + \gamma \| {\bf Z}_y \|_1 \end{equation*}

This looks like a mess, but is just the original data term, the two added $$L_1$$ terms and two additional $$L_2$$ coupling terms. Calling the whole thing $$F({\bf I},{\bf Z}_x, {\bf Z}_y, {\bf U}_x, {\bf U}_y)$$ the ADMM algorithm solves this by the following steps:

• Updates $${\bf I}$$ by solving $${\bf I} = \mbox{argmin}_{\bf I} F({\bf I},{\bf Z}_x, {\bf Z}_y, {\bf U}_x, {\bf U}_y)$$
• Updates $${\bf Z}_x$$ by solving $${\bf Z}_x = \mbox{argmin}_{{\bf Z}_x} F({\bf I},{\bf Z}_x, {\bf Z}_y, {\bf U}_x, {\bf U}_y)$$
• Updates $${\bf U}_x$$ by setting $${\bf U}_x = {\bf U}_x + \frac{\partial}{\partial x} {\bf I} - {\bf Z}_x$$
• Repeats the last two steps, appropriately modified, for $${\bf Z}_y$$ and $${\bf U}_y$$ respectively.

### The $${\bf I}$$ Update

The first step of ADMM is to update $${\bf I}$$. This is a straightforward $$L_2$$ problem since the only terms that $${\bf I}$$ appears is in terms with a 2-norm. Denoting $$\frac{\partial}{\partial x}{\bf I}$$ as $${\bf D}_x \otimes {\bf I}$$ and likewise for $$\frac{\partial}{\partial y}$$, the solve ends up being:

\begin{equation*} {\bf I} = \mbox{argmin} \frac{1}{2} \| {\bf K} \otimes {\bf I} - {\bf B} \|_2^2 + \frac{\rho}{2} \| {\bf D}_x \otimes {\bf I} - {\bf Z}_x + {\bf U}_x \|_2^2 + \frac{\rho}{2} \| {\bf D}_y \otimes {\bf I} - {\bf Z}_y + {\bf U}_y \|_2^2 \end{equation*}

All the terms are effectively the same: a minimization over the residual of a kernel convolved the the target image with respect to a measured value. The minimum is found by expanding out the term and finding the target image that sets the gradient of the expanded term to zero.

The solution for the first term in the Fourier domain is earlier in the post as:

\begin{equation*} {\bf I} = \mathcal{F}^{-1}\left[ \frac{\mathcal{F}({\bf K})^*\mathcal{F}(\bf B)}{\mathcal{F}({\bf K})^2} \right] \end{equation*}

The remaining terms follow the same pattern and sum in the numerator and denominator of the solve:

\begin{equation*} {\bf I} = \mathcal{F}^{-1}\left[ \frac{ \mathcal{F}({\bf K})^*\mathcal{F}(\bf B) +\rho \mathcal{F}({\bf D}_x)^*\mathcal{F}({\bf Z}_x - {\bf U}_x) +\rho \mathcal{F}({\bf D}_y)^*\mathcal{F}({\bf Z}_y - {\bf U}_y) }{\mathcal{F}({\bf K})^2 + \rho \mathcal{F}({\bf D}_x)^2 + \rho \mathcal{F}({\bf D}_y)^2} \right] \end{equation*}

This involves 9 Fourier transforms, but only the terms containing $${\bf Z}_x,{\bf U}_x$$ actually change during the solve (similarly for the y subscripts). By caching the first term of the numerator, the entire denominator, as well as $$\mathcal{F}({\bf D}_x)\mbox{ & }\mathcal{F}({\bf D}_y)$$ this can be reduced to three transforms. It can further be reduced by evaluating the derivative filters $${\bf D}_x\mbox{ & }{\bf D}_y$$ in the spatial domain, since they are very inexpensive finite difference filters. The following gets the $${\bf I}$$ update down to one forward and one inverse transform once contants are cached:

\begin{align*} {\bf V} = \rho \left( {\bf D}^T_x \otimes ({\bf Z}_x - {\bf U}_x) + {\bf D}^T_y \otimes ({\bf Z}_y - {\bf U}_y)\right) \\ {\bf I} = \mathcal{F}^{-1}\left[ \frac{ \mathcal{F}({\bf K})^*\mathcal{F}(\bf B) + \mathcal{F}({\bf V}) }{\mathcal{F}({\bf K})^2 + \rho \mathcal{F}({\bf D}_x)^2 + \rho \mathcal{F}({\bf D}_y)^2} \right] \end{align*}

Important note: the $${\bf D}^T_x \otimes$$ terms in $${\bf V}$$ use the transposed filters (mixing notation terribly). This is the complex conjugate of the filter in the Fourier domain version, or matrix transpose if the convolution operation were represented in matrix form. Expressed as spatial domain convolution, the transpose simply requires mirroring the kernels horizontally and vertically about the central pixel.

### The $${\bf Z}$$ Updates

The next step of ADMM is to update the splitting variable $${\bf Z}_x\mbox{ & }{\bf Z}_y$$. These terms only appear in one coupling terms and one $$L_1$$ term each. The updates for the two $${\bf Z}$$ and two $${\bf U}$$ variables are the same except the specific filter that is used, so I'll only cover one. Here's the update needed for $${\bf Z}_x$$:

\begin{equation*} {\bf Z}_x = \mbox{argmin}_{{\bf Z}_x} \frac{\rho}{2} \| {\bf D}_x \otimes {\bf I} - {\bf Z}_x + {\bf U}_x \|_2^2 + \gamma \| {\bf Z}_x \|_1 \end{equation*}

It's important to note that there are no operators applied to $${\bf Z}_x$$, meaning that each pixel can be processed independently. The terms can be rearranged and the weight factors collected onto the $$L_1$$ term to give the equivalent minimization:

\begin{equation*} {\bf Z}_x = \mbox{argmin}_{{\bf Z}_x} \frac{\gamma}{\rho} \| {\bf Z}_x \|_1 + \frac{1}{2} \| {\bf Z}_x - ({\bf D}_x \otimes {\bf I} + {\bf U}_x ) \|_2^2 \end{equation*}

However, a complication is (still) the $$L_1$$ term. This is non-smooth and tricky to deal with but the form above allows a non-smooth optimization method known as proximal operators to be used. The proximal operator of a scaled function $$\lambda f(q)$$ is defined as:

\begin{equation*} \mbox{prox}_{\lambda f}({\bf q}) = \mbox{argmin}_{\bf p} f({\bf p}) + \frac{1}{2\lambda} \| {\bf p} - {\bf q} \|_2^2 \end{equation*}

The similarities with the $${\bf Z}_x$$ minimization are strong. By setting $$f({\bf Z}_x)=\|{\bf Z}_x\|_1\mbox{ & }\lambda = \frac{\gamma}{\rho}\mbox{ & }{\bf q} = {\bf D}_x \otimes {\bf I} + {\bf U}_x$$ the proximal operator is recovered. This is great because the closed form solution for $$\mbox{prox}_{\lambda\|{\bf p}\|_1}(\bf{q})$$ is known and trivial to implement:

\begin{equation*} \mbox{prox}_{\lambda\|{\bf p}\|_1}({\bf q}) = \mbox{sign}({\bf q}) \mbox{max}( 0, |{\bf q}| - \lambda) \end{equation*}

This is known as the soft-thresholding operator, or $$L_1$$ shrinkage operator and is used everywhere. It can be implemented in python as:

def soft_threshold( q, lam ):
return numpy.sign(q)*numpy.maximum( numpy.abs(q) - lam, 0.0 )


In my view, it's the most potent line of code that exists in signal processing.

### The $${\bf U}$$ Updates

The $${\bf U}$$ updates simply add the current constraint violation to the Lagrange multipliers $${\bf U}$$. It's kind of like the integral term in a PID controller.

The update was actually given already in the overview above but is repeated here for completeness:

\begin{align*} {\bf U}_x = {\bf U}_x + {\bf D}_x \otimes {\bf I} - {\bf Z}_x \\ {\bf U}_y = {\bf U}_y + {\bf D}_x \otimes {\bf I} - {\bf Z}_y \end{align*}

This is identical to the formula ealier, I've just used the $${\bf D}_x \otimes {\bf I}$$ notation rather than $$\frac{\partial}{\partial x} {\bf I}$$.

## Summary

That's all the math that's needed. It looks like a lot but ends up being surprisingly little code. In the next part I work through the implementation.