reproducibilityindex.ai

Learning from Unlabelled Videos Using Contrastive Predictive Neural 3D Mapping

Authors: Adam W. Harley, Shrinidhi K. Lakshmikanth, Fangyu Li, Xian Zhou, Hsiao-Yu Fish Tung, Katerina Fragkiadaki

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show that the proposed model learns visual representations useful for (1) semi-supervised learning of 3D object detectors, and (2) unsupervised learning of 3D moving object detectors, by estimating the motion of the inferred 3D feature maps in videos of dynamic scenes. To the best of our knowledge, this is the ﬁrst work that empirically shows view prediction to be a scalable self-supervised task beneﬁcial to 3D object detection.
Researcher Affiliation	Academia	Adam W. Harley Carnegie Mellon University aharley@cmu.edu Shrinidhi K. Lakshmikanth Carnegie Mellon University kowshika@cmu.edu Fangyu Li Carnegie Mellon University fangyul@cmu.edu Xian Zhou Carnegie Mellon University zhouxian@cmu.edu Hsiao-Yu Fish Tung Carnegie Mellon University htung@cs.cmu.edu Katerina Fragkiadaki Carnegie Mellon University katef@cs.cmu.edu
Pseudocode	No	The paper describes its architecture and methods using text, diagrams, and mathematical equations (e.g., for loss functions). However, it does not include any explicitly labeled pseudocode blocks or algorithms with structured steps.
Open Source Code	Yes	Our code and data are publicly available1. 1https://github.com/aharley/neural_3d_mapping
Open Datasets	Yes	We train our models in CARLA (Dosovitskiy et al., 2017), an open-source photorealistic simulator of urban driving scenes... For additional testing with real-world data, we use the (single-view) object detection benchmark from the KITTI dataset (Geiger et al., 2013)
Dataset Splits	Yes	We treat the Town1 data as the training set, and the Town2 data as the test set, so there is no overlap between the train and test images... For additional testing with real-world data, we use the (single-view) object detection benchmark from the KITTI dataset (Geiger et al., 2013), with the ofﬁcial train/val split: 3712 training frames, and 3769 validation frames.
Hardware Specification	Yes	On 12G Titan X GPUs we encode a space sized 32m 32m 8m at a resolution of 128 128 32; with a batch size of 4, iteration time is 0.2s/iter.
Software Dependencies	No	Our model is implemented in Python/Tensorﬂow, with custom CUDA kernels for the 3D cross correlation... The paper mentions Python, Tensorflow, and CUDA but does not provide specific version numbers for any of these software components.
Experiment Setup	Yes	Inputs Our input images are sized 128 384 pixels... We trim the input pointclouds to a maximum of 100,000 points... The 3D feature encoderdecoder has the following architecture... We use F = 32... For predicting RGB, E = 3; for metric learning, we use E = 32... We use the distance-weighted sampling strategy proposed by Wu et al. (2017)... We use a coefﬁcient of 0.1 for L2D contrast, 1.0 for L3D contrast, and 0.001 for the L2 losses... Training to convergence (approx. 200k iterations) takes 48 hours on a single GPU. We use a learning rate of 0.001 for all modules except the 3D ﬂow module, for which we use 0.0001. We use the Adam optimizer, with β1 = 0.9, β2 = 0.999.