Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding
Authors: Yunze Man, Shuhong Zheng, Zhipeng Bao, Martial Hebert, Liangyan Gui, Yu-Xiong Wang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To address this issue, we present the first comprehensive study that probes various visual encoding models for 3D scene understanding, identifying the strengths and limitations of each model across different scenarios. Our evaluation spans seven vision foundation encoders, including image, video, and 3D foundation models. We evaluate these models in four tasks: Vision-Language Scene Reasoning, Visual Grounding, Segmentation, and Registration, each focusing on different aspects of scene understanding. |
| Researcher Affiliation | Academia | 1 University of Illinois Urbana-Champaign 2 Carnegie Mellon University |
| Pseudocode | No | The paper does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | We promise that we will open-source the data and code after paper acceptance. |
| Open Datasets | Yes | We evaluate the performance on two challenging indoor 3D VQA datasets: Scan QA [5] and SQA3D [55]. |
| Dataset Splits | Yes | We conduct the experiments on the Scan Net [20] segmentation dataset which has 1,201 and 312 scenes for training and validation, respectively, with a total of 20 semantic classes for evaluation. |
| Hardware Specification | Yes | All analyses and computations are performed on an NVIDIA A100 GPU. |
| Software Dependencies | No | The paper does not explicitly state specific version numbers for software dependencies such as programming languages, libraries, or frameworks. |
| Experiment Setup | Yes | We finetune a Q-Former module [48] to align features from different encoders to the LLM input space. More dataset and optimization details are provided in the supplementary material. |