====================================================================
            Unsupervised Learning of Depth and Ego-Motion from Video
      tinghui zhou, matthew brown, noah snavely, david lowe (google)
====================================================================

- SUMMARY
  
  Zhou et al. provide a way to train CNNs in an unsupervised fashion
  (i.e. with only monocular video sequences) to predict depth from a
  single view and camera pose from multiple views. As a loss signal,
  they generate reconstructions of a target frame based on a warp of
  each temporally adjacent source frame and the predicted depth/pose
  and then measure the L1 distance between each of the reconstructed
  images and the original target frame (and then sum them all up).
  
  A few extra notes: (1) they also predict "explainability" for each
  pixel and source, which essentially says how confident the network
  is in its ability to synthesize the view correctly at that target
  pixel for that source. The explainability value is used to weight
  each pixel's contribution to the loss. (2) They compute the L1
  loss at different scales and include a depth smoothness term in
  the loss in order to allow gradients to come from large areas
  instead of just local (neighboring) regions in the source.

- STRENGTHS

  1. The method requires no semblance of labeling; all that is
     required is a bunch of videos from the relevant domain (and
     the camera intrinsics for the videos).
  2. The networks are fully convolutional.

- WEAKNESSES

  1. The network is basically allowed to ignore (or not optimize
     for) difficult structures as per explainability prediction.
     Therefore, with this setup there will be image components that
     the network can't really deal with (as examples, the paper
     mentions thin objects and moving elements).
  2. The camera intrinsics must be known.
  3. The method doesn't perform quite as well at depth estimation
     as some of the methods which do have supervision data.

- CRITIQUE OF EXPERIMENTS

  The paper provides qualitative and quantitative evaluations for
  depth estimation with comparisons to supervised methods. It also
  includes performance for images from a different dataset (Make3D),
  showing solid cross-domain generalizability in relation to other
  models. To evaluate pose estimation performance, Zhou et al.
  consider Absolute Trajectory Error for 5-frame sequences (also
  showing a comparison with other methods here). Finally, they
  visualize predicted explainability maps as part of a qualitative
  discussion. I think they do a good job of highlighting the
  strengths and limitations of their system with their results
  analysis (e.g. the explainability visualizations outing dynamic
  elements for some images as being "unexplainable"), and as far as
  I'm aware they have a fair set of alternative methods to which
  they compare their own approach.

- SUGGESTIONS FOR IMPROVEMENT

  I'm not sure if it would help much, but they could apply random
  transformations to each video sequence to get a greater variety
  of data to train on.

- POSSIBLE EXTENSIONS/FOLLOWUPS

  Instead of always assuming that the intrinsics are known, they
  could add some kind of calibration routine (learned or not) for
  instances in which the intrinsics are not available. They mention
  that they are planning to deal with such cases in future work.

                                                             - ohjay

* View reconstruction (image 1 -> image 2):
  (1) multiply 2D (x, y, 1) image point by inv intrinsic matrix
        -> takes to "normalized coordinates", ray in correct
           direction (ambiguity since don't know depth along ray)
  (2) multiply by depth along ray
        -> takes to correct 3D camera coordinates
  (3) multiply by relative pose
        -> takes to other camera's coordinates
  (4) multiply by other camera's intrinsic matrix
        -> takes to 2D coordinates in second camera's image