Training an Open-Vocabulary Monocular 3D Detection Model without 3D Data

Authors: Rui Huang, Henry Zheng, Yan Wang, Zhuofan Xia, Marco Pavone, Gao Huang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate the superiority of OVM3D-Det over baselines in both indoor and outdoor scenarios. We first evaluate the proposed OVM3D-Det model on outdoor datasets, KITTI [14] and nu Scenes [5], as well as indoor datasets, SUN RGB-D [52] and ARKit Scenes [2] (Sec. 4.2). We then provide thorough ablation studies and detailed analysis to uncover the factors contributing to the effectiveness of OVM3D-Det (Sec. 4.3).
Researcher Affiliation Collaboration Rui Huang1 Henry Zheng1 Yan Wang2 Zhuofan Xia1 Marco Pavone2,3 Gao Huang1,4 1Department of Automation, BNRist, Tsinghua University, China 2NVIDIA Research, USA 3Stanford University, USA 4Beijing Academy of Artificial Intelligence, China
Pseudocode No The paper includes figures illustrating the framework and processes (e.g., Fig. 3, Fig. 4) but does not contain explicitly labeled 'Pseudocode' or 'Algorithm' blocks with structured steps.
Open Source Code No The code will be released.
Open Datasets Yes We evaluate OVM3D-Det on the KITTI [14] and nu Scenes [5] datasets for outdoor settings, and on the SUN RGB-D [52] and ARKit Scenes [2] datasets for indoor settings.
Dataset Splits Yes KITTI [14] has 7,481 images for training and 7,518 images for testing. Since the official test set is unavailable, we follow [3] to resplit the training set into 3,321 training images, 391 validation images, and 3,769 test images. For nu Scenes [5], we use 26,215 images for training, 1,915 images for validation, and 6,019 images for testing. SUN RGB-D [52] consists of a total of 10k samples, each annotated with oriented 3D bounding boxes, in which 4,929 samples are used for training, 356 samples for validation, and 5,050 samples for test following [3]. ARKit Scenes [2] includes 48,046 images for training, 5,268 images for validation, and 7,610 images for testing.
Hardware Specification Yes We train on the KITTI dataset for 8 hours with 2 A100 GPUs, on nu Scenes for 6 hours with 4 A100 GPUs, on SUN RGB-D for 12 hours with 2 A100 GPUs, and on ARKit Scenes for 20 hours with 2 A100 GPUs.
Software Dependencies No The paper mentions software components like Grounded-SAM [47] and Unidepth [40] but does not provide specific version numbers for these or other key software dependencies.
Experiment Setup Yes When searching for the optimal box, we set λ to 5 for indoor scenes and 10 for outdoor scenes to balance the ray tracing loss and point ratio loss. For the adaptive erosion module, for outdoor scenes, if the maximum width of the instance mask exceeds 10 pixels, image erosion is performed for 4 iterations; otherwise, 2 iterations. For indoor scenes, the values are 12 and 2 iterations, respectively... We employ the SGD optimizer, with the learning rate decaying by a factor of 10 at 60% and 80% of the training process. During training, we apply random data augmentation techniques such as horizontal flipping and scaling within the range of [0.50, 1.25]. The model is trained for 29,000 iterations with a batch size of 32 and a learning rate of 0.02 on both SUN RGB-D [52] and ARKit Scenes [2], a batch size of 16 and a learning rate of 0.01 on KITTI [14], and a batch size of 32 and a learning rate of 0.01 on nu Scenes [5].