RadOcc: Learning Cross-Modality Occupancy Knowledge through Rendering Assisted Distillation

Authors: Haiming Zhang, Xu Yan, Dongfeng Bai, Jiantao Gao, Pan Wang, Bingbing Liu, Shuguang Cui, Zhen Li

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results on the nu Scenes dataset demonstrate the effectiveness of our proposed method in improving various 3D occupancy prediction approaches, e.g., our proposed methodology enhances our baseline by 2.2% in the metric of m Io U and achieves 50% in Occ3D benchmark. Experimental Settings Implementation. For the dense prediction, we follow the setting of BEVDet (Huang et al. 2021) and use Swin Transformer (Liu et al. 2021) as the image backbone.
Researcher Affiliation Collaboration Haiming Zhang1,2*, Xu Yan3 , Dongfeng Bai3, Jiantao Gao3, Pan Wang3, Bingbing Liu3, Shuguang Cui2,1, Zhen Li2,1 1FNii, CUHK-Shenzhen, Shenzhen, China 2SSE, CUHK-Shenzhen, Shenzhen, China 3Huawei Noah s Ark Lab {haimingzhang@link., xuyan1@link., lizhen@}cuhk.edu.cn
Pseudocode No The paper describes its methodology in narrative form and does not include structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide an explicit statement or a direct link to the open-source code for the described methodology.
Open Datasets Yes We evaluate our proposed method on nu Scenes (Caesar et al. 2020) for sparse prediction and Occ3D (Tian et al. 2023) for dense prediction.
Dataset Splits Yes The upper part of Table 1 presents the validation set results, where all methods are trained for 24 epochs.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used to run the experiments, only mentioning the image backbone architecture.
Software Dependencies No The paper mentions various models and architectures used (e.g., BEVDet, Swin Transformer, ResNet101-DCN) but does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions) needed for replication.
Experiment Setup Yes For the dense prediction, we follow the setting of BEVDet (Huang et al. 2021) and use Swin Transformer (Liu et al. 2021) as the image backbone. We adopt the semantic scene completion module proposed in (Yan et al. 2021) as our occupancy decoder... Since the challenging nature of the Occ3D test benchmark, we utilize 8 historical frames for temporal encoding and use 3 frames on the validation set. For the sparse prediction, we use previous art TPVFormer (Huang et al. 2023) as our baseline. The rendered size of the network is configured to 384 × 704. To speed up the rendering and reduce memory usage, we randomly sample 80,000 rays during each step.