Large Spatial Model: End-to-end Unposed Images to Semantic 3D

Authors: Zhiwen Fan, Jian Zhang, Wenyan Cong, Peihao Wang, Renjie Li, Kairun Wen, Shijie Zhou, Achuta Kadambi, Zhangyang "Atlas" Wang, Danfei Xu, Boris Ivanovic, Marco Pavone, Yue Wang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Comprehensive experiments on various tasks demonstrate that LSM unifies multiple 3D vision tasks directly from unposed images, achieving real-time semantic 3D reconstruction for the first time.
Researcher Affiliation Collaboration Zhiwen Fan1,2 , Jian Zhang3 , Wenyan Cong1, Peihao Wang1, Renjie Li4, Kairun Wen3, Shijie Zhou5, Achuta Kadambi5, Zhangyang Wang1, Danfei Xu2,6, Boris Ivanovic2, Marco Pavone2,7, Yue Wang2,8 1UT Austin 2NVIDIA Research 3XMU 4TAMU 5UCLA 6Ga Tech 7Stanford University 8USC
Pseudocode No The paper describes the model architecture and procedures in text and diagrams (Figure 2) but does not include structured pseudocode or algorithm blocks.
Open Source Code No We will release our code after our paper gets accepted.
Open Datasets Yes leveraging a combined dataset of Scan Net++[60] and Scannet[61]
Dataset Splits No The paper describes training and testing splits ('we select one image out of four as test images, and the rest ones used as training'), but does not explicitly detail a separate validation set split.
Hardware Specification Yes Training is on 8 Nvidia A100 GPU lasts for 3 days.
Software Dependencies No The paper mentions several models and optimizers (e.g., Vi T-Large, DPT head, DUSt3R, Point Transformer V3, LSeg, Adam W) but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes The training of our model contains 100 epochs, leveraging a combined dataset of Scan Net++[60] and Scannet[61], of 1565 scenes. Training is on 8 Nvidia A100 GPU lasts for 3 days. We start with a base learning rate of 1e-4 and incorporate a 10-epoch warm-up period. Adam W is employed as the optimizer for all experiments. The parameters λ1, λ2, λ3 are set to 0.25, 0.3, and 1.5, respectively, as determined by the grid search.