Unsupervised Scale-consistent Depth and Ego-motion Learning from Monocular Video

Authors: Jiawang Bian, Zhichao Li, Naiyan Wang, Huangying Zhan, Chunhua Shen, Ming-Ming Cheng, Ian Reid

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Comprehensive evaluation results demonstrate that our depth estimator achieves the state-of-the-art performance on the KITTI dataset. Moreover, we show that our ego-motion network is able to predict a globally scale-consistent camera trajectory for long video sequences, and the resulting visual odometry accuracy is competitive with the recent model that is trained using stereo videos. We conduct detailed ablation studies that clearly demonstrate the efficacy of the proposed approach.
Researcher Affiliation Collaboration Jia-Wang Bian1,2, Zhichao Li3, Naiyan Wang3, Huangying Zhan1,2 Chunhua Shen1,2, Ming-Ming Cheng4, Ian Reid1,2 1University of Adelaide, Australia 2Australian Centre for Robotic Vision, Australia 3Tu Simple, China 4Nankai University, China
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide an explicit statement or link for open-source code related to the described methodology.
Open Datasets Yes For depth network, we train and test models on KITTI raw dataset [15] using Eigen [5] s split that is the same with related works [10, 9, 11, 7]. Following [7], we use a snippet of three sequential video frames as a training sample... Also, we pre-train the network on City Scapes [30] and finetune on KITTI [15], each for 200 epochs. For pose network, following Zhan et al. [17], we evaluate visual odometry results on KITTI odometry dataset [15], where sequence 00-08/09-10 are used for training/testing.
Dataset Splits Yes For depth network, we train and test models on KITTI raw dataset [15] using Eigen [5] s split that is the same with related works [10, 9, 11, 7]. Following [7], we use a snippet of three sequential video frames as a training sample... Also, we pre-train the network on City Scapes [30] and finetune on KITTI [15], each for 200 epochs. For pose network, following Zhan et al. [17], we evaluate visual odometry results on KITTI odometry dataset [15], where sequence 00-08/09-10 are used for training/testing.
Hardware Specification Yes We compare with CC [11], and both methods are trained on a single 16GB Tesla V100 GPU.
Software Dependencies No The proposed learning framework is implemented using Py Torch Library [28]. This mentions PyTorch but does not specify a version number, and no other software dependencies are listed with versions.
Experiment Setup Yes We use ADAM [29] optimizer, and set the batch size to 4 and the learning rate to 10 4. During training, we adopt α = 1.0, β = 0.1, and γ = 0.5 in Eqn. 1. We train the network in 200 epochs with 1000 randomly sampled batches in one epoch, and validate the model at per epoch. Also, we pre-train the network on City Scapes [30] and finetune on KITTI [15], each for 200 epochs. Here we follow Eigen et al. [5] s evaluation metrics for depth evaluation.