Self-supervised surround-view depth estimation with volumetric feature fusion

Authors: Jung-Hee Kim, Junhwa Hur, Tien Phuoc Nguyen, Seong-Gyun Jeong

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present a self-supervised depth estimation approach using a unified volumetric feature fusion for surround-view images. Our method outperforms the prior arts on DDAD and nu Scenes datasets, especially estimating more accurate metric-scale depth and consistent depth between neighboring views.
Researcher Affiliation Industry Jung-Hee Kim 42dot Inc. junghee.kim@42dot.ai Junhwa Hur , Google Research junhwahur@google.com Tien Phuoc Nguyen Hyundai Motor Group Innovation Center tien.nguyen@hmgics.com Seong-Gyun Jeong 42dot Inc. seonggyun.jeong@42dot.ai
Pseudocode No The paper describes the proposed architecture and methods in detail using text and diagrams (Figure 2, Figure 3), but it does not include a structured pseudocode block or an algorithm labeled as such.
Open Source Code Yes Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] See Sec. 4, and additional information is described in the supplementary material. We include our implementation in the supplementary material.
Open Datasets Yes We use the DDAD [14] and nu Scenes [2] dataset for our experiments. Both datasets provide surround-view images from a total of 6 cameras mounted on a vehicle and LiDAR point clouds for the depth evaluation. See Sec. 4, as we use public research dataset, we’ve cited their works.
Dataset Splits No The paper states: "We train our model on each train split and report the accuracy on the test split." While it uses public datasets that typically have predefined splits, it does not explicitly mention or quantify a 'validation' dataset split or how it was derived, only 'train' and 'test'.
Hardware Specification Yes We implemented our networks in Py Torch [31] and trained on four A100 GPUs.
Software Dependencies No The paper mentions "Py Torch [31]" but does not specify a version number. It also mentions "Res Net-18" and "Adam optimizer" without associated version numbers for these software components or libraries.
Experiment Setup Yes During training, the input images are down-sampled to a resolution of 384 640 for the DDAD dataset, and that of 352 640 for the nu Scenes dataset. We train our model on the DDAD dataset for 20 epochs and the nu Scenes dataset for 5 epochs. All experiments used the same training hyper-parameters (unless explicitly mentioned): Adam optimizer with β1 = 0.9 and β2 = 0.999; a mini-batch size of 2 per each GPU and a learning rate [40] with 1 10 4, decaying at 3 4 of the entire training schedule with a factor of 0.1;... For our volumetric feature, we used voxel resolution of (1m, 1m, 0.75m) with spatial dimensions of (100, 100, 20) for (x, y, z) axis respectively. We use color jittering as data augmentation. For the depth synthesis loss, we use the random rotation with a range between [-5 , -5 , -25 ] and [5 , 5 , 25 ] for the depth map synthesis at a novel view. In the self-supervised loss in Eq. (2), we use depth smoothness weight λsmooth = 1 10 3, spatio loss weight λsp = 0.03, spatio-temporal weight λsp_t = 0.1, depth consistency weight λcons = 0.05, and depth smoothness weight at novel views λdepth_smooth = 0.03.