PolarFormer: Multi-Camera 3D Object Detection with Polar Transformer

Authors: Yanqin Jiang, Li Zhang, Zhenwei Miao, Xiatian Zhu, Jin Gao, Weiming Hu, Yu-Gang Jiang

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Thorough experiments on the nu Scenes dataset demonstrate that our Polar Former outperforms significantly state-of-the-art 3D object detection alternatives. Extensive experiments on the nu Scenes dataset show that our Polar Former achieves leading performance for camera-based 3D object detection (Figure 1).
Researcher Affiliation Collaboration Yanqin Jiang1,4*, Li Zhang2 , Zhenwei Miao5, Xiatian Zhu6, Jin Gao1,4, Weimin Hu1,4,7, Yu-Gang Jiang3 1NLPR, Institute of Automation, Chinese Academy of Sciences 2School of Data Science, Fudan University 3School of Computer Science, Fudan University 4School of Artificial Intelligence, University of Chinese Academy of Sciences 5Alibaba DAMO Academy 6Surrey Institute for People-Centred Artificial Intelligence, CVSSP, University of Surrey 7School of Information Science and Technology, Shanghai Tech University
Pseudocode No The paper describes its method using mathematical formulations and descriptive text, but it does not include any structured pseudocode or algorithm blocks.
Open Source Code No The paper states: 'We implement our approach based on the codebase mmdetection3d (Contributors 2020). https://github.com/open-mmlab/mmdetection3d. Accessed: 2023-03-03.' This refers to a third-party codebase used for implementation, not the specific source code developed for Polar Former by the authors of this paper.
Open Datasets Yes We evaluate the Polar Former on the nu Scenes dataset (Caesar et al. 2020). It provides images with a resolution of 1600 900 from 6 surrounding cameras (Figure 1). The total of 1000 scenes, where each sequence is roughly 20 seconds long and annotated every 0.5 second, is split officially into train/val/test set with 700/150/150 scenes.
Dataset Splits Yes The total of 1000 scenes, where each sequence is roughly 20 seconds long and annotated every 0.5 second, is split officially into train/val/test set with 700/150/150 scenes.
Hardware Specification Yes We train our models for 24 epochs with the Adam W optimizer and cosine annealing learning rate scheduler on 8 NVIDIA V100 GPUs.
Software Dependencies No The paper states: 'We implement our approach based on the codebase mmdetection3d (Contributors 2020).' While it mentions a codebase, it does not provide specific version numbers for mmdetection3d or other key software dependencies like Python, PyTorch, or CUDA.
Experiment Setup Yes The number of cross-plane encoder layer is set to 3 for each feature scale. The resolution of radius and azimuth for our multi-scale Polar BEV maps are (64, 256), (32, 128), (16, 64) respectively. We use 6 Polar BEV encoder and 6 decoder layers. We train our models for 24 epochs with the Adam W optimizer and cosine annealing learning rate scheduler on 8 NVIDIA V100 GPUs. The initial learning rate is 2 10 4, and the weight decay is set to 0.075. Total batch size is set to 48 across six cameras. Synchronized batch normalization is adopted. All experiments use the original input resolution.