reproducibilityindex.ai

3DET-Mamba: Causal Sequence Modelling for End-to-End 3D Object Detection

Authors: Mingsheng Li, Jiakang Yuan, Sijin Chen, Lin Zhang, Anyu Zhu, Xin Chen, Tao Chen

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate that 3DET-Mamba surpasses previous 3DETR on indoor 3D detection benchmarks such as Scan Net, improving AP@0.25/AP@0.50 from 65.0%/47.0% to 70.4%/54.4%, respectively.
Researcher Affiliation	Collaboration	Mingsheng Li1, , Jiakang Yuan1, , Sijin Chen1,2, Lin Zhang1, Anyu Zhu1, Xin Chen2, Tao Chen1, 1 Fudan University 2 Tecent PCG Equal contribution Corresponding author
Pseudocode	Yes	Algorithm 1 Dual Mamba Block; Algorithm 2 Query-aware Mamba Block
Open Source Code	No	Data and code are not included in the submission due to the time limit, we will release our code in the future.
Open Datasets	Yes	Following previous works on 3D indoor object detection, we evaluate our models on two challenging benchmarks: SUN RGB-D [39] and Scan Net [7].
Dataset Splits	Yes	The SUN RGB-D [39] dataset consists of 10,335 single-view RGB-D scans, with 5,285 used for training and 5,050 for validation. Each sample is annotated with rotated 3D bounding boxes. Scan Net [7] comprises 1,201 training samples and 312 validation samples, with each sample annotated with axis-aligned bounding box labels for 18 object categories.
Hardware Specification	Yes	The whole model is implemented in Py Torch, and all experiments are conducted on 8 NVIDIA 3090 GPUs (24 GB) with a total batch size of 64.
Software Dependencies	No	The paper mentions "implemented in Py Torch" but does not provide specific version numbers for PyTorch or any other software dependencies.
Experiment Setup	Yes	The input to our model is a point cloud P RN 3 representing a 3D scene, with N set as 20,000 for SUN RGB-D [39] and 40,000 for Scan Net [7]. We employ a single-layer inner mamba block that generates 2048 patches, each with 256-dimensional features. The dual mamba encoder has 3 layers and outputs scene features with a hidden dimension of 256. The decoder has 8 layers and is closely followed by MLPs as the bounding box prediction head. During training, we employ standard data augmentation methods, including random cropping, sampling, and flipping. We use the Adam W optimizer with a base learning rate of 7 10 4, decayed to 10 6 using a cosine schedule, and a weight decay of 0.1. Gradient clipping with an ℓ2 norm of 0.1 is applied to stabilize training.