Unifying Voxel-based Representation with Transformer for 3D Object Detection
Authors: Yanwei Li, Yilun Chen, Xiaojuan Qi, Zeming Li, Jian Sun, Jiaya Jia
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The proposed method achieves leading performance in the nu Scenes test set for both object detection and the following object tracking task. Extensive empirical studies are conducted in Section 4 to reveal the effect of each component. |
| Researcher Affiliation | Collaboration | Yanwei Li1 Yilun Chen1 Xiaojuan Qi2 Zeming Li3 Jian Sun3 Jiaya Jia1,4 The Chinese University of Hong Kong1 The University of Hong Kong2 MEGVII Technology3 Smart More4 |
| Pseudocode | No | The paper describes the architecture and processes (e.g., |
| Open Source Code | Yes | Code is made publicly available at https://github.com/dvlab-research/UVTR. |
| Open Datasets | Yes | nu Scenes [43] dataset is a large-scale benchmark for autonomous driving, which is widely adopted for singleor multi-modality 3D object detection. |
| Dataset Splits | Yes | It contains 700, 150, 150 scenes in the train, val, and test set, respectively. Here, ablation studies are optimized on a mini 1/4 train split by default, and final models are optimized on the whole train set. |
| Hardware Specification | Yes | Our experiments in this paper require about 8 NVIDIA Tesla V100 GPUs. |
| Software Dependencies | No | The paper mentions training with |
| Experiment Setup | Yes | Constructed voxel spaces VI, VP , and VU share the same shape 128 128 Z... The channel number C in voxel spaces and transformer decoder is set to 256. And the amount of block M in the decoder is set to 3, 6, and 6 for Li DAR-based, camera-based, and fusion settings, respectively... The framework is trained with Adam W optimizer with an initial learning rate 2e 5 for 20 epochs. For a camera-based setting, the network is optimized with an initial learning rate 1e 4 for 24 epochs. As for fusion, we initialize two modality-specific branches with corresponding pretrained models and optimize the model with an initial learning rate 4e 5 for 20 epochs. |