==================================================================== Unsupervised Learning of Depth and Ego-Motion from Video tinghui zhou, matthew brown, noah snavely, david lowe (google) ==================================================================== - SUMMARY Zhou et al. provide a way to train CNNs in an unsupervised fashion (i.e. with only monocular video sequences) to predict depth from a single view and camera pose from multiple views. As a loss signal, they generate reconstructions of a target frame based on a warp of each temporally adjacent source frame and the predicted depth/pose and then measure the L1 distance between each of the reconstructed images and the original target frame (and then sum them all up). A few extra notes: (1) they also predict "explainability" for each pixel and source, which essentially says how confident the network is in its ability to synthesize the view correctly at that target pixel for that source. The explainability value is used to weight each pixel's contribution to the loss. (2) They compute the L1 loss at different scales and include a depth smoothness term in the loss in order to allow gradients to come from large areas instead of just local (neighboring) regions in the source. - STRENGTHS 1. The method requires no semblance of labeling; all that is required is a bunch of videos from the relevant domain (and the camera intrinsics for the videos). 2. The networks are fully convolutional. - WEAKNESSES 1. The network is basically allowed to ignore (or not optimize for) difficult structures as per explainability prediction. Therefore, with this setup there will be image components that the network can't really deal with (as examples, the paper mentions thin objects and moving elements). 2. The camera intrinsics must be known. 3. The method doesn't perform quite as well at depth estimation as some of the methods which do have supervision data. - CRITIQUE OF EXPERIMENTS The paper provides qualitative and quantitative evaluations for depth estimation with comparisons to supervised methods. It also includes performance for images from a different dataset (Make3D), showing solid cross-domain generalizability in relation to other models. To evaluate pose estimation performance, Zhou et al. consider Absolute Trajectory Error for 5-frame sequences (also showing a comparison with other methods here). Finally, they visualize predicted explainability maps as part of a qualitative discussion. I think they do a good job of highlighting the strengths and limitations of their system with their results analysis (e.g. the explainability visualizations outing dynamic elements for some images as being "unexplainable"), and as far as I'm aware they have a fair set of alternative methods to which they compare their own approach. - SUGGESTIONS FOR IMPROVEMENT I'm not sure if it would help much, but they could apply random transformations to each video sequence to get a greater variety of data to train on. - POSSIBLE EXTENSIONS/FOLLOWUPS Instead of always assuming that the intrinsics are known, they could add some kind of calibration routine (learned or not) for instances in which the intrinsics are not available. They mention that they are planning to deal with such cases in future work. - ohjay * View reconstruction (image 1 -> image 2): (1) multiply 2D (x, y, 1) image point by inv intrinsic matrix -> takes to "normalized coordinates", ray in correct direction (ambiguity since don't know depth along ray) (2) multiply by depth along ray -> takes to correct 3D camera coordinates (3) multiply by relative pose -> takes to other camera's coordinates (4) multiply by other camera's intrinsic matrix -> takes to 2D coordinates in second camera's image