MonoMAE: Enhancing Monocular 3D Detection through Depth-Aware Masked Autoencoders

Authors: Xueying Jiang, Sheng Jin, Xiaoqin Zhang, Ling Shao, Shijian Lu

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments over KITTI 3D and nu Scenes show that Mono MAE outperforms the state-of-the-art consistently and it can generalize to new domains as well.
Researcher Affiliation Academia Xueying Jiang1, Sheng Jin1, Xiaoqin Zhang2, Ling Shao3, Shijian Lu1 1S-Lab, Nanyang Technological University, Singapore 2College of Computer Science and Technology, Zhejiang University of Technology, China 3UCAS-Terminus AI Lab, University of Chinese Academy of Sciences, China
Pseudocode No The paper describes the method and its components in detail but does not provide explicit pseudocode or algorithm blocks.
Open Source Code No The used datasets are publicly available. We will consider releasing the code upon acceptance.
Open Datasets Yes KITTI 3D [13] comprises 7,481 training images and 7,518 testing images, with training-data labels publicly available and test-data labels stored on a test server for evaluation. Nu Scenes [3] comprises 1,000 video scenes
Dataset Splits Yes Following [7], we divide the 7,481 training samples into a new train set with 3,712 images and a validation set with 3,769 images for ablation studies. The dataset [Nu Scenes] is split into a training set (700 scenes), a validation set (150 scenes), and a test set (150 scenes).
Hardware Specification Yes We conduct experiments on one NVIDIA V100 GPU and train the framework for 200 epochs with a batch size of 16 and a learning rate of 2 10 4.
Software Dependencies No The paper mentions using AdamW, ResNet-50 as backbone, and a 3D detection head from [66], but does not provide specific version numbers for these software components or any other libraries.
Experiment Setup Yes We conduct experiments on one NVIDIA V100 GPU and train the framework for 200 epochs with a batch size of 16 and a learning rate of 2 10 4. We use the Adam W [36] optimizer with weight decay 10 4.