DiffBEV: Conditional Diffusion Model for Bird’s Eye View Perception

Authors: Jiayu Zou, Kun Tian, Zheng Zhu, Yun Ye, Xingang Wang

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Diff BEV achieves a 25.9% m Io U on the nu Scenes dataset, which is 6.2% higher than the bestperforming existing approach. Quantitative and qualitative results on multiple benchmarks demonstrate the effectiveness of Diff BEV in BEV semantic segmentation and 3D object detection tasks. Extensive experiments on multiple benchmarks demonstrate that Diff BEV achieves state-of-the-art performance and is effective in semantic segmentation and 3D object detection.
Researcher Affiliation Collaboration Jiayu Zou1,3, Kun Tian1,3, Zheng Zhu2, Yun Ye2, Xingang Wang1* 1Institute of Automation, Chinese Academy of Sciences 2Phi Gent Robotics 3University of Chinese Academy of Sciences
Pseudocode No The paper describes methods using text and equations but does not include structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide concrete access to source code for the methodology described.
Open Datasets Yes We compare the performance of Diff BEV with the existing methods on four different benchmarks, i.e. nu Scenes (Caesar et al. 2020), KITTI Raw (Geiger, Lenz, and Urtasun 2012), KITTI Odometry (Behley et al. 2019), and KITTI 3D Object (Geiger, Lenz, and Urtasun 2012).
Dataset Splits Yes As illustrated in Tab. 1, we report the segmentation performance of Diff BEV and some advanced methods described in Section . It can be seen that the previous state-of-the-art method LSS (Philion and Fidler 2020) is good at predicting static objects with wide coverage, such as the drivable area, walkway, and pedestrian crossing, compared to the car, pedestrian, bicycle, etc. Table 1: Intersection over Union scores (%) of hybrid scene layout estimation on the nu Scenes val dataset.
Hardware Specification Yes Two NVIDIA Ge Force RTX 3090 are utilized and the mini-batch per GPU is set to 4 images.
Software Dependencies No The paper mentions software components like 'Adam W optimizer' and 'Swin Transformer' but does not provide specific version numbers for reproducibility.
Experiment Setup Yes We train all semantic segmentation models using the Adam W optimizer (Ilya and Frank 2017) with learning rate and weight decay as 2e-4 and 0.01. Two NVIDIA Ge Force RTX 3090 are utilized and the mini-batch per GPU is set to 4 images. The input resolution is 800 600 for nu Scenes and 1024 1024 for KITTI datasets. The total training schedule includes 20,000 iterations (200, 000 iterations for nu Scenes) and the warm-up strategy (Goyal et al. 2017) gradually increases the learning rate for the first 1,500 iterations. Then, a cyclic policy (Yan, Mao, and Li 2018) linearly decreases the learning rate from 2e-4 to 0 during the remainder training process.