Task-Aware Monocular Depth Estimation for 3D Object Detection

Authors: Xinlong Wang, Wei Yin, Tao Kong, Yuning Jiang, Lei Li, Chunhua Shen12257-12264

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We propose the foreground-background separated monocular depth estimation (Fore Se E) method, to estimate the foreground and background depth using separate optimization objectives and decoders. Our method significantly improves the depth estimation performance on foreground objects. Applying Fore Se E to 3D object detection, we achieve 7.5 AP gains and set new state-of-the-art results among other monocular methods.
Researcher Affiliation Collaboration 1The University of Adelaide, Australia, 2Bytedance AI Lab
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No Code will be available at: https://github.com/WXinlong/Fore Se E.
Open Datasets Yes KITTI dataset (Geiger et al. 2013) has witnessed inspiring progress in the field of depth estimation. As most of scenes in KITTI-Raw data have limited foreground objects, we construct a new benchmark which is based on KITTI-Object dataset. We collect the corresponding groundtruth depth map for each image in KITTI-Object training set, and term it as KITTI-Object-Depth (KOD) dataset. A total of 7, 481 image-depth pairs are divided into training and testing subsets with 3, 712 and 3, 769 samples respectively (Chen et al. 2015), which makes sure that images in the two subsets belong to different video clips. 2D bounding boxes are used to distinguish the foreground and background pixels. Pixels fall within the foreground bounding boxes are designated as foreground pixels, while the other pixels are assigned to be background.
Dataset Splits No A total of 7, 481 image-depth pairs are divided into training and testing subsets with 3, 712 and 3, 769 samples respectively (Chen et al. 2015) - The paper only specifies training and testing subsets, without mentioning a validation split or set.
Hardware Specification No The Stochastic Gradient Descent (SGD) solver is adopted to optimize the network on a single GPU.
Software Dependencies No The paper mentions 'Image Net pretrained Res Ne Xt-101' but does not provide specific software dependencies with version numbers.
Experiment Setup Yes For depth estimation, we follow the most of settings in baseline method (Wei et al. 2019). The Image Net pretrained Res Ne Xt-101 (Xie et al. 2017) is used as the backbone model. We train the network for 20 epochs, with batch size 4 and base learning rate set to 0.001. The Stochastic Gradient Descent (SGD) solver is adopted to optimize the network on a single GPU. λf and λb in foreground-background sensitive loss function are set to 0.2.