[Auto]Stitching Photo Mosaics

CS 194-26 Project #7 · Owen Jow · November 2016

[00] Setting the Scene

For this project, we produce seamlessly stitched mosaics by shooting overlapped photographs with an identical center of projection, computing projective transformations between those photographs, warping them toward whichever is selected as a base image, and finally blending all of the processed photographs into one big panorama.

To begin, we shoot pictures that are related by a projective transformation (by taking pictures from the same point of view but with different view directions). We are also careful to ensure that consecutive photographs have some common content, because in order to recover the transform between two images we have to select corresponding points in the images' overlapping scenes.

[01] Feature Detection

In selecting corresponding points, we have two options: manually click on high-profile feature points within the overlap across each of the two images, or let the computer find all the features itself. The first option is a ginput call – we have to do all the thinking, so it's easy for the computer and hard for us. On the other hand, the second option is hard for the computer and easy for us. We obviously invented computers so we could enslave them and make them do all the difficult things in life, so everyone's preference here should be a no-brainer.

Unfortunately, we still have to give it some orders. How should our feature points be selected? In this case, we'll use the Harris interest point detector to identify decent features (i.e. Harris points) in an image. A Harris point is a fancy name for a corner, a location in the image with a large spatial gradient in all directions, found via Harris's algorithm. Several reduced-size sets of Harris points can be seen below.

5000 / 235746 Harris points (randomly selected) 1000 / 238458 Harris points

1000 / 159983 Harris points

Note that the above points are only random subsets of all of the Harris points. The corner detector is coming up with ~200k points per image! That's just excessive. In order to trim the size, we'll (a) filter the Harris points by using corner_peaks instead of peak_local_max and (b) utilize adaptive non-maximal suppression, or ANMS, as described in "Multi-Image Matching using Multi-Scale Oriented Patches" by Brown et al. ANMS aims to select some number of points across the image that are both spaced out and locally maximized in terms of "corner strength." It does so by first sorting points according to their minimum suppression radius $r_i$, where

$$r_i = \min_{j}|\textbf{x}_i - \textbf{x}_j|, \text{ s.t. } \, h(\textbf{x}_i) < c_{robust}h(\textbf{x}_j), \textbf{x}_j \in I$$

($\textbf{x}_i$ is the Harris point in question, $I$ is the set of all Harris points, $c_{robust} = 0.9$ is a constant determining the level of suppression, and $h(\textbf{x})$ is the corner strength of point $\textbf{x}$.)

After the points are sorted, we take the maximum ~500 points and use those as our ANM-suppressed selection. Intuitively, this gives us the strongest point in each region because it maximizes the minimum spatial distance from every point to a stronger point.

From this moment onward, we mostly use corner_peaks to compute Harris points.

corner_peaks output (17767 points) 500 ANMS-filtered

500 highest strength 500 lowest from all ~200k

corner_peaks output (3944 points) 500 ANMS-filtered

500 highest strength 500 lowest from all ~200k

[02] Descriptor Extraction and Point Matching

After we obtain our feature points, we have to match them. Our plan of attack: extract feature descriptors for each point and then match those instead. Said descriptors characterize the immediate region around every feature point. Accordingly, every such descriptor will be represented for us as a normalized local patch of size 8 x 8 – downsampled, that is, from 40 x 40. We make our descriptors invariant under rotation in order to increase robustness.

To reiterate, these are the steps we take to generate our feature descriptors:

                    for every ANMS point:

                        grab the ~60 x ~60 grayscale window around the point

                        compute the avg gradient at the point in the x- and y- directions

                        rotate the window by the negative atan of y-gradient / x-gradient

                        extract the central 40 x 40 window from the rotated patch

                        blur it with a Gaussian filter

                        downsize the patch so that it's 8 x 8

                        bias/gain-normalize the patch using I' = (I - μ) / σ

We rotate in order to standardize the direction of the gradients, making the descriptors invariant under rotation. If two descriptors contain the same content, but one is rotated, this transform should cause both of them to face the same way.

ANMS points (500 in total) Descriptors upscaled to 80 x 80

Once we have the descriptors for each image – maybe it hasn't been said, but we're only stitching together two images at any given time – we can match them. We can match them by finding, for every descriptor in the first image, the most similar descriptor in the second image (aka the "nearest neighbor" / NN). Similarity is defined by a sum of squared differences (SSD) function, over which we minimize while seeking resemblance.

Since not every pair of features will fit the model – and since in our case, having no match is much better than having a bad match! – we'll only pair the points we're sufficiently confident about. This is accomplished by following Lowe, or in other words imposing a threshold on the ratio between the error for the 1-NN match and the error for the 2-NN match. We want this ratio to be minimal, such that the error for the 1-NN is small and the error for the 2-NN is large (indicating, as it should be, that one match is clearly correct and everything else is clearly not). Based on the plot below (screenshotted from the MOPS paper linked previously) we shouldn't be using matches for which $(error_{1-NN}) / (error_{2-NN}) > 0.66$ at the very least. However, wanting to be really confident in our pairings, we set the threshold at 0.3 for safety's sake. When we get bad matches anyway, we set it to 0.1.

If the ratio exceeds 0.66, a match is more likely to be incorrect than correct

This was the only thing in the project that worked on the first try. I was hyped.

im1 (30 matches) im2 (30 matches)

[03] RANSAC for Homography Estimation

It looks good, but we're not necessarily out of the tunnel yet. We need to be dead sure that there are no homography-ruining outliers in our set of feature matches. (By "dead sure," I actually mean "almost" dead sure because nothing in life is certain.)

So we use RANSAC. RANSAC, or RANdom SAmple Consensus, is a technique for estimating the homography matrix (math for that below) while avoiding feature-space outliers. How does it do this? The steps are as follows (all of these actions are performed within a loop that runs for 5000 iterations):

Select four pairs of matching features at random
Calculate the homography matrix $H$ for the chosen points
Apply the homography to all of the feature points in one image
Count the total number of inliers, i.e. points where $SSD(\textbf{p}_i', H\textbf{p}_i) < \epsilon$
Track the largest set of inliers yielded by any homography, and treat those as THE inliers

At this point, we compute the homography over all of the inliers and that's it! That's our homography.

[04] Solving for the Homography Matrix

In order to identify the transformation between any two of our images, we need to compute the homography matrix $H$. We can relate a pair of corresponding points through the equation $\textbf{p}' = H\textbf{p}$, where $\textbf{p}$ is an $\begin{bmatrix}x & y & 1\end{bmatrix}^T$ vector representing a point in the first image and $\textbf{p}'$ is a $\begin{bmatrix}wx' & wy' & w\end{bmatrix}^T$ representing a scaled point in the second image. $H$, incidentally, is what we're solving for and exists as a 3 x 3 matrix with eight degrees of freedom.

$$ \begin{bmatrix} wx' \\ wy' \\ w \end{bmatrix} = \begin{bmatrix} a & b & c \\ d & e & f \\ g & h & 1 \end{bmatrix} \begin{bmatrix} x \\ y \\ 1 \end{bmatrix} $$

In other words, we need to obtain values for $a, b, ..., h$ given a set of $(x, y)$ and $(x', y')$ points. To do this, we flatten $H$ into a vector $\textbf{h}$ containing the eight unknowns and attempt to set up the matrix equation $A\textbf{h} = \textbf{b}$. After setting $b$ to be the vector of all the $(x', y')$ points, we are left to derive only the entries of $A$. Accordingly, we focus on relating our givens – the $(x, y)$ and $(x', y')$ coordinates – using $a, b, ..., h$.

It follows from the original homography equation that

$$ ax + by + c = wx' \\ dx + ey + f = wy' \\ gx + hy + 1 = w $$

We can easily rearrange these equations into equalities for $x'$ and $y'$:

$$ \rightarrow \: ax + by + c = (gx + hy + 1)x' \\ ax + by + c - gxx' - hx'y = x' \\~\\ \rightarrow \: dx + ey + f = (gx + hy + 1)y' \\ dx + ey + f - gxy' - hyy' = y' $$

At this point, we can rewrite our equalities as a matrix equation that's more or less in the $A\textbf{h} = \textbf{b}$ form we're looking for.

$$ \begin{bmatrix} x & y & 1 & 0 & 0 & 0 & -xx' & -x'y \\ 0 & 0 & 0 & x & y & 1 & -xy' & -yy' \end{bmatrix} \begin{bmatrix} a \\ b \\ c \\ d \\ e \\ f \\ g \\ h \end{bmatrix} = \begin{bmatrix} x' \\ y' \end{bmatrix} $$

Then, through inspection of the above relationship, we are able to reach our final system of $2n$ equations. ($n$ is the number of point correspondences between the two images). Since we have eight unknowns, we need $n \geq 4$ in order to successfully solve for $\textbf{h}$. Meanwhile, for stability's sake we will ideally have more than four correspondences (i.e. an overdetermined linear system of equations), so we solve for $\textbf{h}$ using least-squares.

$$ \begin{bmatrix} x_1 & y_1 & 1 & 0 & 0 & 0 & -x_1x_1' & -x_1'y_1 \\ 0 & 0 & 0 & x_1 & y_1 & 1 & -x_1y_1' & -y_1y_1' \\ x_2 & y_2 & 1 & 0 & 0 & 0 & -x_2x_2' & -x_2'y_2 \\ 0 & 0 & 0 & x_2 & y_2 & 1 & -x_2y_2' & -y_2y_2' \\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots \\ x_n & y_n & 1 & 0 & 0 & 0 & -x_nx_n' & -x_n'y_n \\ 0 & 0 & 0 & x_n & y_n & 1 & -x_ny_n' & -y_ny_n' \end{bmatrix} \begin{bmatrix} a \\ b \\ c \\ d \\ e \\ f \\ g \\ h \end{bmatrix} = \begin{bmatrix} x_1' \\ y_1' \\ x_2' \\ y_2' \\ \vdots \\ x_n' \\ y_n' \end{bmatrix} $$

Dimensionally speaking, we have (a $2n$ x 8 matrix) $\cdot$ (an 8 x 1 vector) = (a $2n$ x 1 vector). After we've solved for $\textbf{h}$, we can turn it into a 9 x 1 vector by appending a 1 (that being the scaling factor) and reshape it into a 3 x 3 matrix to serve as our final homography transformation. Recall that this matrix can be used to warp points in the first image into the second image's coordinate system.

[05] Applying Our Homography

Let's go into a little more detail on that front. We're warping an image, which means we're taking the colors that were already in the image and moving them to different places.

However, in our case we actually want to make use of an inverse warp. An inverse warp asks, for every pixel location in the output shape, "which color from the input image should go here?" The answer to that is usually a combination of input colors, so we bilinearly interpolate across these in order to make sure we don't miss anything.

Mathematically, of course, it's our homography that tells us the locations of the input colors to use. To derive the inverse warp, we begin again from our initial equation (but en masse): $P' = HP$. This time, $P$ is a 3 x $n$ matrix containing all of the points from the input image (the one we're warping; the one whose colors we're moving around) in homogeneous coordinates. Likewise, $P'$ is a 3 x $n$ matrix containing scaled points from the image whose coordinate system we're warping to. $H$, as before, is the homography matrix that converts $P$'s spatial domain into $P'$'s spatial domain.

$$ P' = HP \\ \begin{bmatrix} w_1x_1' & w_2x_2' & \cdots & w_nx_n' \\ w_1y_1' & w_2y_2' & \cdots & w_ny_n' \\ w_1 & w_2 & \cdots & w_n \end{bmatrix} = \begin{bmatrix} a & b & c \\ d & e & f \\ g & h & 1 \end{bmatrix} \begin{bmatrix} x_1 & x_2 & \cdots & x_n \\ y_1 & y_2 & \cdots & y_n \\ 1 & 1 & \cdots & 1 \end{bmatrix} $$

We want the inverse warp, though, so we look for $P$ in terms of $P'$ – the input location corresponding to every output location. Once we have these, we can look up the colors around each of these input locations ("around" because the locations might be between pixels) and interpolate.

$$ P = H^{-1}P' \\ \begin{bmatrix} x_1 & x_2 & \cdots & x_n \\ y_1 & y_2 & \cdots & y_n \\ 1 & 1 & \cdots & 1 \end{bmatrix} = H^{-1} \begin{bmatrix} w_1x_1' & w_2x_2' & \cdots & w_nx_n' \\ w_1y_1' & w_2y_2' & \cdots & w_ny_n' \\ w_1 & w_2 & \cdots & w_n \end{bmatrix} $$

Equivalently (if, for example, a certain SciPy function requires that the image points be row vectors), we can use transpose properties to define

$$ \begin{bmatrix} x_1 & y_1 & 1 \\ x_2 & y_2 & 1 \\ \vdots & \vdots & \vdots \\ x_n & y_n & 1 \end{bmatrix} = \begin{bmatrix} w_1x_1' & w_1y_1' & w_1 \\ w_2x_2' & w_2y_2' & w_2 \\ \vdots & \vdots & \vdots \\ w_nx_n' & w_ny_n' & w_n \end{bmatrix} (H^{-1})^T $$

Realistically, however, we won't know the $w_i$ values ahead of time (for $i \in [1 ... n]$). What we'll actually compute is:

$$ \begin{bmatrix} v_1x_1 & v_1y_1 & v_1 \\ v_2x_2 & v_2y_2 & v_2 \\ \vdots & \vdots & \vdots \\ v_nx_n & v_ny_n & v_n \end{bmatrix} = \begin{bmatrix} x_1' & y_1' & 1 \\ x_2' & y_2' & 1 \\ \vdots & \vdots & \vdots \\ x_n' & y_n' & 1 \end{bmatrix} (H^{-1})^T $$

This isn't a problem, since we can simply renormalize each of the row vectors until the third coordinate is always 1. Anyway (as mentioned like three times), we can use the input image locations to complete our inverse warp via color interpolation! At this juncture, we're fully able to warp an image based on a pointwise projective transformation.

[06] Rectifying Images

Enough math. Let's take what we've done so far and produce something visual! As a test of projective warp correctness, we attempt to rectify a couple of images. Specifically, we'll try to set up our warps so that certain planar surfaces are frontal-parallel in the output image.

In the examples below, we select four input points over the corners of some planar surface, and then use the rectified version of those points as the corresponding $P'$ coordinates.

Sunset view from my apartment (11/11/16) Rectified using sq. window as a basis

Assortment of rectangular objects Rectified over Family Life

[07] Producing Panoramic Mosaics

Here, we warp and blend a sequence of overlapping images together – creating, in expectation, a nicely stitched panorama. Given a set of photographs that don't span more than 180 degrees (faraway scenes work well, since we can then project everything onto a plane at infinity), we start by selecting a centrally positioned photograph to serve as the base panorama image. We then project and blend all other images one-by-one into this photograph's coordinate system, amassing a holistic panorama as we go.

That's pretty much it!

Since individual images are projected directly onto their correct coordinates in the overall panorama, we don't need to align anything after a warp. All we need to do is blend, which we handle through weighted averaging at every pixel. To facilitate this weighting, we maintain alpha values for each image's pixels that represent how far every pixel is from the unwarped image's center column, i.e. how important the pixel is to its source. We assign the alpha value for the center column to 1, and let this value fall off linearly as it approaches the sides (such that it is 0 at the edges).

Then, after warping, we compute the output pixels specific to the individual image, the output pixels specific to the growing panorama image, and the output pixels at which there is overlap. For the non-overlap pixels, we use the unmodified RGB values from their source images. For the overlap pixels, we normalize the alpha values so that they sum to 1 and then linearly interpolate between the color values of each image (using the normalized alphas as weights). Finally, we set the entire alpha channel to 1 in the output panorama at each stage in order to get rid of any residual fading effects.

Below we have some results. The photographs are from me biking up to Indian Rock and taking a bunch of pictures of the distance; they yielded some decent mosaics in my opinion (especially after cropping!). [Part B edit: The automatically stitched images have now been included above each of their manual counterparts.]

Cropped view from a rock, pt. 1 (auto)

View from a rock, pt. 1 (auto)

Cropped view from a rock, pt. 1 (manual)

View from a rock, pt. 1 (manual)

Left image Center image Right image

The second panorama also turned out all right. However, success was tempered by the fact that I – being an unprofessional panorama photographer – managed to capture a bunch of pesky and ephemerally present humans across the bottom of my photographs. Even after being blended into the final image, their ghosts can still be observed.

Cropped view from a rock, pt. 2 (auto)

View from a rock, pt. 2 (auto)

Cropped view from a rock, pt. 2 (manual)

View from a rock, pt. 2 (manual)

Left image Center image Right image

Technically speaking, my third panorama was the least successful because it contained a lot of moving elements (in the form of automobiles) and also spanned more than 180 degrees of view. You can see from the top right corner that the third image's projection is beginning to warp out of control! You might also notice a car without its shadow, and a shadow without its car.

I think I also messed up somewhat on keeping my center of projection constant.

"The Circle", cropped (auto)

It's up north somewhere (auto)

"The Circle", cropped (manual)

Manual feature selection

Left image Center image Right image

Two other panoramas I enjoyed (both were stitched automatically):

[08] Bells and Whistles: Cylindrical Mapping

For non-planar (read: wide) fields of view, we'll want to utilize a cylindrical projection. Although this distorts the edges of the image, it's a worthwhile tradeoff when we're looking for a panorama with an exceedingly large horizontal dimension. As an added requirement, however, we need our input images to be horizontally aligned from the get-go; we're taking the homography out of the mix here.

This time, we compute the inverse warping as follows:

$$ \begin{array}{c c c} \theta = (x' - x_c) / f & \:\:\:\:\: \hat{x} = \sin{\theta} \:\:\:\:\: & x = f\hat{x} / \hat{z} + x_c \\ h = (y' - y_c) / f & \:\:\:\:\: \hat{y} = h \:\:\:\:\: & y = f\hat{y} / \hat{z} + y_c \\ & \:\:\:\:\: \hat{z} = \cos{\theta} \:\:\:\:\: & \end{array} $$

In practice, this can be condensed to

$$ \begin{array}{c} x = f\tan{((x' - x_c) / f)} \\ y = (y' - y_c) / \cos{((x' - x_c) / f)} \end{array} $$

$f$ is the focal length in pixels, which can be computed from the EXIF focal length and the camera CCD width (which is 6.08mm on my point-and-shoot) as specified here. $x$ and $y$ are the coordinates in the source image, while $x_c$ and $y_c$ are the center coordinates of the unwrapped output image. Again, we're given $x', y',$ and $f$ and we want to determine the source coordinates $(x, y)$ from which to interpolate colors.

Once we have our cylindrically warped images, the only thing we still need to do is compute the translation between adjacent components. (Note: it's merely a translation because we're essentially just sliding images back and forth along the lengthwise axis of the unwrapped cylinder.) Under the assumption that we're only stitching in the x-direction, we use as our translation the average horizontal difference between corresponding feature points. To obtain these feature points, we execute the same descriptor-matching procedure as before. We also reject outliers from the resulting set of differences before taking the mean.

Input image Cylindrically warped output

In order to generate a panorama with a cylindrical mapping scheme, we run two stages. The first stage involves warping all of the input images; the second involves aligning them and blending them all together. During the second stage, we loop in order to stitch the images one-by-one into a constantly growing base image.

Here we have the output of the first stage.

1 2 3 4 5 6 7

As an example, we can automatically combine the cylindrically warped images above into a single panorama. Regrettably, our panorama's most salient feature – the human subject's face – is slightly blurred. I'd like to hastily assign blame to said human subject for moving around between camera shots. (At the same time, he did an admirable job of keeping mostly still.)

[09] Bells and Whistles: 360° Panorama

Using our cylindrical warping technique, we can now put together a panorama spanning 360 degrees of view. We do this by using the same stitching algorithm as that of the previous section, while also making sure to merge the final image and the initial image at the end (so as to create a "true" 360 experience).

The input images and final panorama can be seen below. Alignment wasn't perfect, but in examining the source images I've come to the conclusion that a perfect translational x-alignment doesn't exist anyway. To deal with the misalignment we'd probably have to do something more sophisticated (or take better pictures).

1 2 3 4 5 6 7 8 9

We also have an interactive version (courtesy Pannellum; they have some pretty high-quality demos if you want to see some real/spherical panoramas!). Click and drag in order to pan around the scene. You can also zoom in/out if you so desire. If the viewer doesn't work – although hopefully it will for a very long time – it probably means you're from the distant future and my Imgur source / Pannellum's CDNs have gone down. Or you need to enable JavaScript. One or the other.

[10] Bells and Whistles: Rotation Invariance

As discussed in one of the previous sections, our feature descriptors are also invariant under rotation. This is nice, because even if we fail at photography and rotate the camera around a non-yaw axis, our automatically detected features can still be standardized and matched.

You probably don't believe me, so I've prepared as proof a couple of demonstrations –

Kavi im1 feature detection (no rotation) Kavi im2 feature detection (no rotation)

Kavi im1 feature detection (rotation) Kavi im2 feature detection (rotation)

Seems like it still works! It's finding matching features, at least. As seen below, we can also run the full stitching algorithm on an arbitrarily rotated set of images. Here's hoping you aren't tired of Indian Rock scenery yet...

Indian Rock image 1 features Indian Rock image 2 features

Rotationally invariant panorama stitching

The sky-slicing vertical seam is likely an artifact caused by paint-bucketing transparent regions in the source PNG image.

[11] Bells and Whistles: Panorama Recognition

Here we implement an algorithm that, given an unordered list of images (some of which might match up), is able to automatically discover pairwise panoramas and stitch them together. It does this by extracting feature descriptors for every image and then exhaustively searching for every image's nearest neighbor via the SSD matching technique we've come to know and love. (The image with the most feature matches is our best guess for each image's neighbor.)

Since not every image is guaranteed to have a panorama partner, we consider a subject to be isolated when, again emulating Lowe, the ratio between the match count with the 1-NN and the match count with the 2-NN is less than some threshold... say 4. If the 2-NN count is 0, we're willing to pair the image with its 1-NN as long as the 1-NN count is greater than 4. We also are forced to assume isolation if an image has less than four feature matches with its 1-NN (or RANSAC later leaves it with less than three inliers overall). Four is the absolute minimum number of feature pairings required, since that's how many correspondences we need in order to define a homography.

To test our discovery algorithm, we feed it the following five images in the following scrambled order and let it do its thing. We expect an output of two panoramas... and to our excitement, the algorithm delivers.

Output panorama #1 (raw)

Output panorama #2 (raw)

[12] Looking Back

First of all, I learned that planar mappings don't really work for > 180° mosaics, as I spent an embarrassing amount of time trying to warp and merge two images that spanned an excessive view space.

However, I'm glad I got to experience the power of homographies / projective transformations. It was very exciting to be able to modify the perspective of my images in a specific way, and to be able to stitch multiple photographs into one! I never realized that it was possible to achieve these effects with just a 3 x 3 matrix and some blending.

I leave you with a rectified image of my lunch. (Because I definitely ate all of these pizzas.) Thanks for reading! :)

Lunchtime carnage Rectified carnage

Part B edit: I was incredibly impressed by how well features could automatically be detected and matched; feats like this are what'll finally convince people to entrust their lives to self-driving cars.

As a summary, I learned from this project that SSD is godlike and that I have a new post-processing option whenever I want to capture a widely sweeping scene over multiple photographs. It's great how easy it is to use after Part B; all I have to do is feed the program a list of filenames and there it is! A panorama...!