Self-supervised surround-view depth estimation with volumetric feature fusion
Authors: Jung-Hee Kim, Junhwa Hur, Tien Phuoc Nguyen, Seong-Gyun Jeong
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present a self-supervised depth estimation approach using a unified volumetric feature fusion for surround-view images. Our method outperforms the prior arts on DDAD and nu Scenes datasets, especially estimating more accurate metric-scale depth and consistent depth between neighboring views. |
| Researcher Affiliation | Industry | Jung-Hee Kim 42dot Inc. junghee.kim@42dot.ai Junhwa Hur , Google Research junhwahur@google.com Tien Phuoc Nguyen Hyundai Motor Group Innovation Center tien.nguyen@hmgics.com Seong-Gyun Jeong 42dot Inc. seonggyun.jeong@42dot.ai |
| Pseudocode | No | The paper describes the proposed architecture and methods in detail using text and diagrams (Figure 2, Figure 3), but it does not include a structured pseudocode block or an algorithm labeled as such. |
| Open Source Code | Yes | Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] See Sec. 4, and additional information is described in the supplementary material. We include our implementation in the supplementary material. |
| Open Datasets | Yes | We use the DDAD [14] and nu Scenes [2] dataset for our experiments. Both datasets provide surround-view images from a total of 6 cameras mounted on a vehicle and LiDAR point clouds for the depth evaluation. See Sec. 4, as we use public research dataset, we’ve cited their works. |
| Dataset Splits | No | The paper states: "We train our model on each train split and report the accuracy on the test split." While it uses public datasets that typically have predefined splits, it does not explicitly mention or quantify a 'validation' dataset split or how it was derived, only 'train' and 'test'. |
| Hardware Specification | Yes | We implemented our networks in Py Torch [31] and trained on four A100 GPUs. |
| Software Dependencies | No | The paper mentions "Py Torch [31]" but does not specify a version number. It also mentions "Res Net-18" and "Adam optimizer" without associated version numbers for these software components or libraries. |
| Experiment Setup | Yes | During training, the input images are down-sampled to a resolution of 384 640 for the DDAD dataset, and that of 352 640 for the nu Scenes dataset. We train our model on the DDAD dataset for 20 epochs and the nu Scenes dataset for 5 epochs. All experiments used the same training hyper-parameters (unless explicitly mentioned): Adam optimizer with β1 = 0.9 and β2 = 0.999; a mini-batch size of 2 per each GPU and a learning rate [40] with 1 10 4, decaying at 3 4 of the entire training schedule with a factor of 0.1;... For our volumetric feature, we used voxel resolution of (1m, 1m, 0.75m) with spatial dimensions of (100, 100, 20) for (x, y, z) axis respectively. We use color jittering as data augmentation. For the depth synthesis loss, we use the random rotation with a range between [-5 , -5 , -25 ] and [5 , 5 , 25 ] for the depth map synthesis at a novel view. In the self-supervised loss in Eq. (2), we use depth smoothness weight λsmooth = 1 10 3, spatio loss weight λsp = 0.03, spatio-temporal weight λsp_t = 0.1, depth consistency weight λcons = 0.05, and depth smoothness weight at novel views λdepth_smooth = 0.03. |