Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
SPARTUN3D: Situated Spatial Understanding of 3D World in Large Language Model
Authors: Yue Zhang, Zhiyang Xu, Ying Shen, Parisa Kordjamshidi, Lifu Huang
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results demonstrate that both our proposed dataset and alignment module significantly enhance the situated spatial understanding of 3D-based LLMs. |
| Researcher Affiliation | Academia | 1Michigan State University 2Virginia Tech 3 University of Illinois at Urbana-Champaign 4UC Davis EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper describes the proposed methods and system architecture through textual descriptions and figures (e.g., Fig. 3, 5, 7), but it does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | 1https://github.com/zhangyuejoslin/Spartun3D |
| Open Datasets | Yes | To address the aforementioned issues, we propose two key innovations: we first introduce a scalable, LLM-generated dataset named Spartun3D... The 3D scenes in Spartun3D are taken from 3RScan (Wu et al., 2021), which provides a diverse set of realistic 3D environments. ... SQA3D (Ma et al., 2022) introduces a human-annotated dataset where the model generates answers based on questions and given situations. |
| Dataset Splits | Yes | Table 1: Dataset statistics of Spartun3D and human validation results. Tasks # of Examples Train/Test Captioning 10K 8, 367/1, 350 Attr. & Rel. 62K 61, 254/8, 168 Affordance 40K 35, 070/5, 017 Planning 21K 19, 434/2, 819 |
| Hardware Specification | Yes | The model is trained on a 6 NVIDIA RTX A6000 GPU for around 30 hours with 15 epochs. |
| Software Dependencies | No | The paper mentions several models and frameworks like PointNet++ (Qi et al., 2017), LEO (Huang et al., 2023), OPT1.3B (Zhang et al., 2023b), Vicuna7B (Chiang et al., 2023), and LoRA (Hu et al., 2021). However, it does not specify version numbers for any software dependencies or libraries used for implementation. |
| Experiment Setup | Yes | The maximum context length and output length of LLM are both set to 256. For each 3D scene, we sample up to 60 objects with 1024 points per object. During training, the pre-trained 3D point cloud encoder and the LLM are frozen. We set rank and α in Lo RA to be 16 and dropout rate to be 0. During inference, we employ beam search to generate the textual response, and the number of beams is 5. The model is trained on a 6 NVIDIA RTX A6000 GPU for around 30 hours with 15 epochs. The learning rate is 3e 5, and the batch size is 24. |