3D-LLM: Injecting the 3D World into Large Language Models

Authors: Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, Chuang Gan

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on held-out evaluation dataset, Scan QA, SQA3D and 3DMV-VQA, outperform state-of-the-art baselines.
Researcher Affiliation Collaboration Yining Hong University of California, Los Angeles Haoyu Zhen Shanghai Jiao Tong University Peihao Chen South China University of Technology Shuhong Zheng University of Illinois Urbana-Champaign Yilun Du Massachusetts Institute of Technology Zhenfang Chen MIT-IBM Watson AI Lab Chuang Gan UMass Amherst and MIT-IBM Watson AI Lab
Pseudocode No The paper does not contain any pseudocode or algorithm blocks.
Open Source Code Yes We release our 3D-LLMs, the 3D-language dataset, and language-aligned 3D features of the dataset for future research development 1. 1https://github.com/UMass-Foundation-Model/3D-LLM
Open Datasets Yes We mainly establish our 3D-language dataset upon several 3D assets: Objaverse, Scannet [12], Habitat-Matterport (HM3D) [39], HM3DSem [44]. ... We use three held-out 3D question answering datasets for held-out evaluation: Scan QA, SQA3D and 3DMV-VQA.
Dataset Splits Yes We utilize training sets of held-in datasets for pre-training foundation 3D-LLMs, and their validation sets can be applied for held-in evaluation.
Hardware Specification No The paper mentions "computation support from Ai MOS, a server cluster for the IBM Research AI Hardware Center" but does not provide specific hardware details like GPU or CPU models.
Software Dependencies No The paper mentions using specific models like "Flamingo 9B, BLIP-2 Vit-g Opt2.7B, BLIP-2 Vit-g Flan T5-XL" and libraries like "LAVIS library [28]" and "Open Flamingo repository [2]", and models like "Mask2Former (M2F) [9] or the segment anything (SAM) [26]", but does not provide version numbers for these or other software dependencies.
Experiment Setup Yes Architecture We experiment on three backbone 2D VLMs for 3D-LLMs: Flamingo 9B, BLIP-2 Vit-g Opt2.7B, BLIP-2 Vit-g Flan T5-XL. For BLIP-2, during pre-training the 3D-LLMs, we initialize the model from BLIP-2 checkpoints released in LAVIS library [28], and finetune the parameters for the QFormer. 3D features are 1408-dim features, same as EVA-CLIP [41] hidden feature dim used by BLIP-2. We keep most parts of the LLMs (i.e., Opt and Flan T5) frozen, except the weights for the newly-added location tokens in the input and the output embeddings. For Flamingo, we initialize the model from the Flamingo9B checkpoint released in Open Flamingo repository [2]. We finetune the parameters for perceiver, gated cross attention layers, and the weights for additional location tokens in the input and output embeddings. 3D features are 1024-dim features, same as CLIP hidden feature dim used by Flamingo.