DeepInteraction: 3D Object Detection via Modality Interaction
Authors: Zeyu Yang, Jiaqi Chen, Zhenwei Miao, Wei Li, Xiatian Zhu, Li Zhang
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on the large-scale nu Scenes dataset show that our proposed method surpasses all prior arts often by a large margin. |
| Researcher Affiliation | Collaboration | 1Fudan University 2Alibaba DAMO Academy 3S-Lab, NTU 4University of Surrey |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | https://github.com/fudan-zvg/Deep Interaction |
| Open Datasets | Yes | We evaluate our approach on the nu Scenes dataset [3]. It provides point clouds from 32-beam Li DAR and images with a resolution of 1600 900 from 6 surrounding cameras. The total of 1000 scenes, where each sequence is roughly 20 seconds long and annotated every 0.5 second, is officially split into train/val/test set with 700/150/150 scenes. |
| Dataset Splits | Yes | The total of 1000 scenes, where each sequence is roughly 20 seconds long and annotated every 0.5 second, is officially split into train/val/test set with 700/150/150 scenes. |
| Hardware Specification | Yes | Our Li DAR-only baseline is trained for 20 epochs and Li DAR-image fusion for 6 epochs with batch size of 16 using 8 NVIDIA V100 GPUs. We compare inference speed tested on NVIDIA V100, A6000 GPUs and A100 separately. |
| Software Dependencies | No | The paper mentions software like mmdetection3d, ResNet-50, and Cascade Mask R-CNN but does not provide specific version numbers for these software dependencies. |
| Experiment Setup | Yes | For the image branch backbone, we use a simple Res Net-50 [16] and initialize it from the instance segmentation model Cascade Mask R-CNN [4] pretrained on COCO [27] and then nu Image [3], which is same as Transfusion [1]. To save the computation cost, we rescale the input image to 1/2 of its original size before feeding into the network, and freeze the weights of image branch during training. The voxel size is set to (0.075m, 0.075m, 0.2m), and the detection range is set to [ 54m, 54m] for X and Y axis and [ 5m, 3m] for Z axis. The representational interaction encoder is composed by stacking two representational interaction layers. For the multi-modal predictive interaction decoder, we use 5 cascaded decoder layers. We set the query number to 200 for training and testing and use the same query initialization method as Transfusion [1]. We use Adam optimizer with one-cycle learning rate policy, with max learning rate 1 10 3, weight decay 0.01 and momentum 0.85 to 0.95, following CBGS [48]. Our Li DAR-only baseline is trained for 20 epochs and Li DAR-image fusion for 6 epochs with batch size of 16 using 8 NVIDIA V100 GPUs. |