V-DETR: DETR with Vertex Relative Position Encoding for 3D Object Detection

Authors: Yichao Shen, Zigang Geng, Yuhui Yuan, Yutong Lin, Ze Liu, Chunyu Wang, Han Hu, Nanning Zheng, Baining Guo

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct thorough experiments to empirically show that our simple DETR-based approach significantly outperforms the previous state-of-the-art fully convolutional 3D detection methods, which helps to accelerate the convergence of the detection head architecture design for 2D and 3D detection tasks. We report the results of our approach on two challenging indoor 3D object detection benchmarks including Scan Net V2 and SUN RGB-D.
Researcher Affiliation Collaboration 1National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, National Engineering Research Center for Visual Informationand Applications, and Institute of Artificial Intelligence and Robotics, Xi an Jiaotong University 2University of Science and Technology of China 3Microsoft Research Asia
Pseudocode No The paper describes the proposed method and its framework using text and diagrams, but it does not include any formal pseudocode or algorithm blocks.
Open Source Code No The paper does not contain any explicit statement about releasing source code for V-DETR or provide a link to a code repository.
Open Datasets Yes Scan Net V2 (Dai et al., 2017): Scan Net V2 consists of 3D meshes recovered from RGB-D videos captured in various indoor scenes. It has about 12K training meshes and 312 validation meshes, each annotated with semantic and instance segmentation masks for around 18 classes of objects. We follow Qi et al. (2019) to extract the point clouds from the meshes. SUN RGB-D (Song et al., 2015): SUN RGB-D is a single-view RGB-D image dataset. It has about 5K images for both training and validation sets. Each image is annotated with oriented 3D bounding boxes for 37 classes of objects.
Dataset Splits Yes Scan Net V2 (Dai et al., 2017): ...It has about 12K training meshes and 312 validation meshes... SUN RGB-D (Song et al., 2015): ...It has about 5K images for both training and validation sets.
Hardware Specification Yes We evaluate all numbers on a Tesla V100 PCIe 16 GB GPU with batch size as 1 for a fair comparison.
Software Dependencies No The paper mentions software components and frameworks like 'Adam W optimizer', 'DETR framework', and 'Transformer decoder', but it does not specify any exact version numbers for these or other software dependencies (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes We use the Adam W optimizer (Loshchilov & Hutter, 2019) with the base learning rate 7e-4, the batch size 8, and the weight decay 0.1. The learning rate is warmed up for 9 epochs, then is dropped to 1e-6 using the cosine schedule during the entire training process. We use gradient clipping to stabilize the training. We train for 360 epochs on Scan Net V2 and 240 epochs on SUN RGB-D in all experiments except for the system-level comparisons, where we train for 540 epochs on Scan Net V2.