Let \(\mathbf{X} = (X, Y, Z)\) be a 3D point in camera coordinates. Let \(f\) be the focal length (in other words, the distance from the center of projection to the image plane). By similar triangles, the projection of \(\mathbf{X}\) onto the image plane is at \((x, y, f)\), where
\[\begin{aligned} x &= fX / Z \\ y &= fY / Z \end{aligned}\]up to a possible sign flip which would depend on the directions the coordinate axes were pointing. In 3D terms this describes the ray from the center of projection to the image point. Usually we will ignore the \(f\) in \((x, y, f)\) because we only care about the 2D image coordinates. The typical \(3 \times 3\) intrinsic matrix \(\mathbf{K}\) actually just projects the inhomogeneous \(\mathbf{X}\) to the homogeneous \((x, y, 1)\) and never bothers to account for the 3D \(z\)-coordinate of the image plane.
\[\mathbf{K} = \begin{bmatrix} f & 0 & 0 \\ 0 & f & 0 \\ 0 & 0 & 1 \end{bmatrix}\](Sometimes there are also translation and distortion terms, but this is the simplest form of the matrix which produces the \(x = fX / Z\) and \(y = fY / Z\) relationships from before.)
\[\begin{aligned} \begin{bmatrix} f & 0 & 0 \\ 0 & f & 0 \\ 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} X \\ Y \\ Z \end{bmatrix} &= \begin{bmatrix} fX \\ fY \\ Z \end{bmatrix} \\ &= \begin{bmatrix} fX / Z \\ fY / Z \\ 1 \end{bmatrix} \end{aligned}\]One interesting thing is that we can apply \(\mathbf{K}^{-1}\) in order to back-project our 2D point \((x, y, 1)\) to the original ray \((x, y, f)\) passing through the pinhole!
\[\begin{aligned} \mathbf{K}^{-1} \begin{bmatrix} x \\ y \\ 1 \end{bmatrix} &= \begin{bmatrix} \frac{1}{f} & 0 & 0 \\ 0 & \frac{1}{f} & 0 \\ 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} x \\ y \\ 1 \end{bmatrix} \\ &= \begin{bmatrix} x / f \\ y / f \\ 1 \end{bmatrix} \\ &= \begin{bmatrix} x \\ y \\ f \end{bmatrix} \end{aligned}\]Of course, in this case it’s a homogeneous entity and we don’t know the actual scale of the 3D point in camera coordinates (i.e. we don’t know how far along the ray the actual 3D point is). All we have is this vector representing a ray. If we like, we can rescale it ourselves and make it a unit vector.
By the way, sometimes we call this form (\(\mathbf{K}^{-1}\) times the image point) the normalized coordinates of the point. Note that this stuff works even if our \(\mathbf{K}\) is a little more complicated! [19bp]
In the camera resectioning problem, we estimate the \(3 \times 4\) camera projection matrix \(\mathbf{P}\) from a set of 2D/3D point correspondences. Note that \(\mathbf{P}\) homogeneously projects 3D world points to 2D image points; it is the extrinsic matrix and intrinsic matrix combined.
We can obtain a linear estimate for \(\mathbf{P}\) via the direct linear transformation (DLT) method. Then we can use this as an initial solution to an iterative nonlinear estimation method such as Levenberg-Marquardt to get a more refined solution. [19bq]
In the perspective-n-point (PnP) problem, we estimate the camera pose (\(\mathbf{R}, \mathbf{t}\)) from a collection of 2D/3D point correspondences. The assumption is that we know the camera calibration matrix. A general strategy for solving this problem is to estimate the (exact) 3D camera coordinates that generated the 2D points, and then solve for the rigid transformation (\(\mathbf{R}, \mathbf{t}\)) between the 3D world coordinates and the 3D camera coordinates using e.g. Umeyama’s method [which estimates the rotation and translation (and scale, if we want) between two sets of n-D points]. [19bq]
One way to obtain a linear estimate for the camera pose is to use the EPnP (efficient PnP) algorithm proposed by Lepetit et al. It takes \(n\) 2D/3D point correspondences and estimates the camera pose, and is called “efficient” because its runtime is linear in the number of correspondences [\(O(n)\)].
The EPnP algorithm parameterizes the 3D points as being a weighted combination of four control points. We define these control points and then compute the correct weights for each 3D point in world coordinates. Next we solve for the control points in the camera frame and use them, along with the same weights from before, to obtain each 3D point in camera coordinates. Finally we can use Umeyama to solve for the \(R\) and \(t\) between the world and camera representations. [19bq]
To go from a point in 2D image coordinates to the point in 3D camera coordinates, we can make use of Finsterwalder’s solution to the P3P problem. It assumes that we have (a) three 2D image points, (b) the distances between each pair of points in 3D space, and (c) the camera calibration matrix, and gives us up to four solutions for the same points in 3D camera coordinates. [19bq]
Using the inverse camera calibration matrix, Finsterwalder converts each image point into a unit vector representing the ray going from the image point through the pinhole. The problem setup considers the tetrahedron formed by the three 3D points and the pinhole, and solves for the scale of each ray which takes it to its corresponding 3D point. In this way we obtain the 3D points in camera coordinates. Note that this only uses three points, as opposed to an arbitrary \(n\) like EPnP. On the other hand, it only needs three point correspondences, unlike EPnP which needs six.