Unifying Voxel-based Representation with Transformer for 3D Object Detection

Authors: Yanwei Li, Yilun Chen, Xiaojuan Qi, Zeming Li, Jian Sun, Jiaya Jia

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The proposed method achieves leading performance in the nu Scenes test set for both object detection and the following object tracking task. Extensive empirical studies are conducted in Section 4 to reveal the effect of each component.
Researcher Affiliation Collaboration Yanwei Li1 Yilun Chen1 Xiaojuan Qi2 Zeming Li3 Jian Sun3 Jiaya Jia1,4 The Chinese University of Hong Kong1 The University of Hong Kong2 MEGVII Technology3 Smart More4
Pseudocode No The paper describes the architecture and processes (e.g.,
Open Source Code Yes Code is made publicly available at https://github.com/dvlab-research/UVTR.
Open Datasets Yes nu Scenes [43] dataset is a large-scale benchmark for autonomous driving, which is widely adopted for singleor multi-modality 3D object detection.
Dataset Splits Yes It contains 700, 150, 150 scenes in the train, val, and test set, respectively. Here, ablation studies are optimized on a mini 1/4 train split by default, and final models are optimized on the whole train set.
Hardware Specification Yes Our experiments in this paper require about 8 NVIDIA Tesla V100 GPUs.
Software Dependencies No The paper mentions training with
Experiment Setup Yes Constructed voxel spaces VI, VP , and VU share the same shape 128 128 Z... The channel number C in voxel spaces and transformer decoder is set to 256. And the amount of block M in the decoder is set to 3, 6, and 6 for Li DAR-based, camera-based, and fusion settings, respectively... The framework is trained with Adam W optimizer with an initial learning rate 2e 5 for 20 epochs. For a camera-based setting, the network is optimized with an initial learning rate 1e 4 for 24 epochs. As for fusion, we initialize two modality-specific branches with corresponding pretrained models and optimize the model with an initial learning rate 4e 5 for 20 epochs.