Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

SPARTUN3D: Situated Spatial Understanding of 3D World in Large Language Model

Authors: Yue Zhang, Zhiyang Xu, Ying Shen, Parisa Kordjamshidi, Lifu Huang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate that both our proposed dataset and alignment module significantly enhance the situated spatial understanding of 3D-based LLMs.
Researcher Affiliation Academia 1Michigan State University 2Virginia Tech 3 University of Illinois at Urbana-Champaign 4UC Davis EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes the proposed methods and system architecture through textual descriptions and figures (e.g., Fig. 3, 5, 7), but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes 1https://github.com/zhangyuejoslin/Spartun3D
Open Datasets Yes To address the aforementioned issues, we propose two key innovations: we first introduce a scalable, LLM-generated dataset named Spartun3D... The 3D scenes in Spartun3D are taken from 3RScan (Wu et al., 2021), which provides a diverse set of realistic 3D environments. ... SQA3D (Ma et al., 2022) introduces a human-annotated dataset where the model generates answers based on questions and given situations.
Dataset Splits Yes Table 1: Dataset statistics of Spartun3D and human validation results. Tasks # of Examples Train/Test Captioning 10K 8, 367/1, 350 Attr. & Rel. 62K 61, 254/8, 168 Affordance 40K 35, 070/5, 017 Planning 21K 19, 434/2, 819
Hardware Specification Yes The model is trained on a 6 NVIDIA RTX A6000 GPU for around 30 hours with 15 epochs.
Software Dependencies No The paper mentions several models and frameworks like PointNet++ (Qi et al., 2017), LEO (Huang et al., 2023), OPT1.3B (Zhang et al., 2023b), Vicuna7B (Chiang et al., 2023), and LoRA (Hu et al., 2021). However, it does not specify version numbers for any software dependencies or libraries used for implementation.
Experiment Setup Yes The maximum context length and output length of LLM are both set to 256. For each 3D scene, we sample up to 60 objects with 1024 points per object. During training, the pre-trained 3D point cloud encoder and the LLM are frozen. We set rank and α in Lo RA to be 16 and dropout rate to be 0. During inference, we employ beam search to generate the textual response, and the number of beams is 5. The model is trained on a 6 NVIDIA RTX A6000 GPU for around 30 hours with 15 epochs. The learning rate is 3e 5, and the batch size is 24.