Learning from Unlabelled Videos Using Contrastive Predictive Neural 3D Mapping

Authors: Adam W. Harley, Shrinidhi K. Lakshmikanth, Fangyu Li, Xian Zhou, Hsiao-Yu Fish Tung, Katerina Fragkiadaki

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that the proposed model learns visual representations useful for (1) semi-supervised learning of 3D object detectors, and (2) unsupervised learning of 3D moving object detectors, by estimating the motion of the inferred 3D feature maps in videos of dynamic scenes. To the best of our knowledge, this is the first work that empirically shows view prediction to be a scalable self-supervised task beneficial to 3D object detection.
Researcher Affiliation Academia Adam W. Harley Carnegie Mellon University aharley@cmu.edu Shrinidhi K. Lakshmikanth Carnegie Mellon University kowshika@cmu.edu Fangyu Li Carnegie Mellon University fangyul@cmu.edu Xian Zhou Carnegie Mellon University zhouxian@cmu.edu Hsiao-Yu Fish Tung Carnegie Mellon University htung@cs.cmu.edu Katerina Fragkiadaki Carnegie Mellon University katef@cs.cmu.edu
Pseudocode No The paper describes its architecture and methods using text, diagrams, and mathematical equations (e.g., for loss functions). However, it does not include any explicitly labeled pseudocode blocks or algorithms with structured steps.
Open Source Code Yes Our code and data are publicly available1. 1https://github.com/aharley/neural_3d_mapping
Open Datasets Yes We train our models in CARLA (Dosovitskiy et al., 2017), an open-source photorealistic simulator of urban driving scenes... For additional testing with real-world data, we use the (single-view) object detection benchmark from the KITTI dataset (Geiger et al., 2013)
Dataset Splits Yes We treat the Town1 data as the training set, and the Town2 data as the test set, so there is no overlap between the train and test images... For additional testing with real-world data, we use the (single-view) object detection benchmark from the KITTI dataset (Geiger et al., 2013), with the official train/val split: 3712 training frames, and 3769 validation frames.
Hardware Specification Yes On 12G Titan X GPUs we encode a space sized 32m 32m 8m at a resolution of 128 128 32; with a batch size of 4, iteration time is 0.2s/iter.
Software Dependencies No Our model is implemented in Python/Tensorflow, with custom CUDA kernels for the 3D cross correlation... The paper mentions Python, Tensorflow, and CUDA but does not provide specific version numbers for any of these software components.
Experiment Setup Yes Inputs Our input images are sized 128 384 pixels... We trim the input pointclouds to a maximum of 100,000 points... The 3D feature encoderdecoder has the following architecture... We use F = 32... For predicting RGB, E = 3; for metric learning, we use E = 32... We use the distance-weighted sampling strategy proposed by Wu et al. (2017)... We use a coefficient of 0.1 for L2D contrast, 1.0 for L3D contrast, and 0.001 for the L2 losses... Training to convergence (approx. 200k iterations) takes 48 hours on a single GPU. We use a learning rate of 0.001 for all modules except the 3D flow module, for which we use 0.0001. We use the Adam optimizer, with β1 = 0.9, β2 = 0.999.