3DET-Mamba: Causal Sequence Modelling for End-to-End 3D Object Detection
Authors: Mingsheng Li, Jiakang Yuan, Sijin Chen, Lin Zhang, Anyu Zhu, Xin Chen, Tao Chen
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that 3DET-Mamba surpasses previous 3DETR on indoor 3D detection benchmarks such as Scan Net, improving AP@0.25/AP@0.50 from 65.0%/47.0% to 70.4%/54.4%, respectively. |
| Researcher Affiliation | Collaboration | Mingsheng Li1, , Jiakang Yuan1, , Sijin Chen1,2, Lin Zhang1, Anyu Zhu1, Xin Chen2, Tao Chen1, 1 Fudan University 2 Tecent PCG Equal contribution Corresponding author |
| Pseudocode | Yes | Algorithm 1 Dual Mamba Block; Algorithm 2 Query-aware Mamba Block |
| Open Source Code | No | Data and code are not included in the submission due to the time limit, we will release our code in the future. |
| Open Datasets | Yes | Following previous works on 3D indoor object detection, we evaluate our models on two challenging benchmarks: SUN RGB-D [39] and Scan Net [7]. |
| Dataset Splits | Yes | The SUN RGB-D [39] dataset consists of 10,335 single-view RGB-D scans, with 5,285 used for training and 5,050 for validation. Each sample is annotated with rotated 3D bounding boxes. Scan Net [7] comprises 1,201 training samples and 312 validation samples, with each sample annotated with axis-aligned bounding box labels for 18 object categories. |
| Hardware Specification | Yes | The whole model is implemented in Py Torch, and all experiments are conducted on 8 NVIDIA 3090 GPUs (24 GB) with a total batch size of 64. |
| Software Dependencies | No | The paper mentions "implemented in Py Torch" but does not provide specific version numbers for PyTorch or any other software dependencies. |
| Experiment Setup | Yes | The input to our model is a point cloud P RN 3 representing a 3D scene, with N set as 20,000 for SUN RGB-D [39] and 40,000 for Scan Net [7]. We employ a single-layer inner mamba block that generates 2048 patches, each with 256-dimensional features. The dual mamba encoder has 3 layers and outputs scene features with a hidden dimension of 256. The decoder has 8 layers and is closely followed by MLPs as the bounding box prediction head. During training, we employ standard data augmentation methods, including random cropping, sampling, and flipping. We use the Adam W optimizer with a base learning rate of 7 10 4, decayed to 10 6 using a cosine schedule, and a weight decay of 0.1. Gradient clipping with an ℓ2 norm of 0.1 is applied to stabilize training. |