3D-LLM: Injecting the 3D World into Large Language Models
Authors: Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, Chuang Gan
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on held-out evaluation dataset, Scan QA, SQA3D and 3DMV-VQA, outperform state-of-the-art baselines. |
| Researcher Affiliation | Collaboration | Yining Hong University of California, Los Angeles Haoyu Zhen Shanghai Jiao Tong University Peihao Chen South China University of Technology Shuhong Zheng University of Illinois Urbana-Champaign Yilun Du Massachusetts Institute of Technology Zhenfang Chen MIT-IBM Watson AI Lab Chuang Gan UMass Amherst and MIT-IBM Watson AI Lab |
| Pseudocode | No | The paper does not contain any pseudocode or algorithm blocks. |
| Open Source Code | Yes | We release our 3D-LLMs, the 3D-language dataset, and language-aligned 3D features of the dataset for future research development 1. 1https://github.com/UMass-Foundation-Model/3D-LLM |
| Open Datasets | Yes | We mainly establish our 3D-language dataset upon several 3D assets: Objaverse, Scannet [12], Habitat-Matterport (HM3D) [39], HM3DSem [44]. ... We use three held-out 3D question answering datasets for held-out evaluation: Scan QA, SQA3D and 3DMV-VQA. |
| Dataset Splits | Yes | We utilize training sets of held-in datasets for pre-training foundation 3D-LLMs, and their validation sets can be applied for held-in evaluation. |
| Hardware Specification | No | The paper mentions "computation support from Ai MOS, a server cluster for the IBM Research AI Hardware Center" but does not provide specific hardware details like GPU or CPU models. |
| Software Dependencies | No | The paper mentions using specific models like "Flamingo 9B, BLIP-2 Vit-g Opt2.7B, BLIP-2 Vit-g Flan T5-XL" and libraries like "LAVIS library [28]" and "Open Flamingo repository [2]", and models like "Mask2Former (M2F) [9] or the segment anything (SAM) [26]", but does not provide version numbers for these or other software dependencies. |
| Experiment Setup | Yes | Architecture We experiment on three backbone 2D VLMs for 3D-LLMs: Flamingo 9B, BLIP-2 Vit-g Opt2.7B, BLIP-2 Vit-g Flan T5-XL. For BLIP-2, during pre-training the 3D-LLMs, we initialize the model from BLIP-2 checkpoints released in LAVIS library [28], and finetune the parameters for the QFormer. 3D features are 1408-dim features, same as EVA-CLIP [41] hidden feature dim used by BLIP-2. We keep most parts of the LLMs (i.e., Opt and Flan T5) frozen, except the weights for the newly-added location tokens in the input and the output embeddings. For Flamingo, we initialize the model from the Flamingo9B checkpoint released in Open Flamingo repository [2]. We finetune the parameters for perceiver, gated cross attention layers, and the weights for additional location tokens in the input and output embeddings. 3D features are 1024-dim features, same as CLIP hidden feature dim used by Flamingo. |