Training an Open-Vocabulary Monocular 3D Detection Model without 3D Data
Authors: Rui Huang, Henry Zheng, Yan Wang, Zhuofan Xia, Marco Pavone, Gao Huang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate the superiority of OVM3D-Det over baselines in both indoor and outdoor scenarios. We first evaluate the proposed OVM3D-Det model on outdoor datasets, KITTI [14] and nu Scenes [5], as well as indoor datasets, SUN RGB-D [52] and ARKit Scenes [2] (Sec. 4.2). We then provide thorough ablation studies and detailed analysis to uncover the factors contributing to the effectiveness of OVM3D-Det (Sec. 4.3). |
| Researcher Affiliation | Collaboration | Rui Huang1 Henry Zheng1 Yan Wang2 Zhuofan Xia1 Marco Pavone2,3 Gao Huang1,4 1Department of Automation, BNRist, Tsinghua University, China 2NVIDIA Research, USA 3Stanford University, USA 4Beijing Academy of Artificial Intelligence, China |
| Pseudocode | No | The paper includes figures illustrating the framework and processes (e.g., Fig. 3, Fig. 4) but does not contain explicitly labeled 'Pseudocode' or 'Algorithm' blocks with structured steps. |
| Open Source Code | No | The code will be released. |
| Open Datasets | Yes | We evaluate OVM3D-Det on the KITTI [14] and nu Scenes [5] datasets for outdoor settings, and on the SUN RGB-D [52] and ARKit Scenes [2] datasets for indoor settings. |
| Dataset Splits | Yes | KITTI [14] has 7,481 images for training and 7,518 images for testing. Since the official test set is unavailable, we follow [3] to resplit the training set into 3,321 training images, 391 validation images, and 3,769 test images. For nu Scenes [5], we use 26,215 images for training, 1,915 images for validation, and 6,019 images for testing. SUN RGB-D [52] consists of a total of 10k samples, each annotated with oriented 3D bounding boxes, in which 4,929 samples are used for training, 356 samples for validation, and 5,050 samples for test following [3]. ARKit Scenes [2] includes 48,046 images for training, 5,268 images for validation, and 7,610 images for testing. |
| Hardware Specification | Yes | We train on the KITTI dataset for 8 hours with 2 A100 GPUs, on nu Scenes for 6 hours with 4 A100 GPUs, on SUN RGB-D for 12 hours with 2 A100 GPUs, and on ARKit Scenes for 20 hours with 2 A100 GPUs. |
| Software Dependencies | No | The paper mentions software components like Grounded-SAM [47] and Unidepth [40] but does not provide specific version numbers for these or other key software dependencies. |
| Experiment Setup | Yes | When searching for the optimal box, we set λ to 5 for indoor scenes and 10 for outdoor scenes to balance the ray tracing loss and point ratio loss. For the adaptive erosion module, for outdoor scenes, if the maximum width of the instance mask exceeds 10 pixels, image erosion is performed for 4 iterations; otherwise, 2 iterations. For indoor scenes, the values are 12 and 2 iterations, respectively... We employ the SGD optimizer, with the learning rate decaying by a factor of 10 at 60% and 80% of the training process. During training, we apply random data augmentation techniques such as horizontal flipping and scaling within the range of [0.50, 1.25]. The model is trained for 29,000 iterations with a batch size of 32 and a learning rate of 0.02 on both SUN RGB-D [52] and ARKit Scenes [2], a batch size of 16 and a learning rate of 0.01 on KITTI [14], and a batch size of 32 and a learning rate of 0.01 on nu Scenes [5]. |