Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding

Authors: Yunze Man, Shuhong Zheng, Zhipeng Bao, Martial Hebert, Liangyan Gui, Yu-Xiong Wang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To address this issue, we present the first comprehensive study that probes various visual encoding models for 3D scene understanding, identifying the strengths and limitations of each model across different scenarios. Our evaluation spans seven vision foundation encoders, including image, video, and 3D foundation models. We evaluate these models in four tasks: Vision-Language Scene Reasoning, Visual Grounding, Segmentation, and Registration, each focusing on different aspects of scene understanding.
Researcher Affiliation Academia 1 University of Illinois Urbana-Champaign 2 Carnegie Mellon University
Pseudocode No The paper does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes We promise that we will open-source the data and code after paper acceptance.
Open Datasets Yes We evaluate the performance on two challenging indoor 3D VQA datasets: Scan QA [5] and SQA3D [55].
Dataset Splits Yes We conduct the experiments on the Scan Net [20] segmentation dataset which has 1,201 and 312 scenes for training and validation, respectively, with a total of 20 semantic classes for evaluation.
Hardware Specification Yes All analyses and computations are performed on an NVIDIA A100 GPU.
Software Dependencies No The paper does not explicitly state specific version numbers for software dependencies such as programming languages, libraries, or frameworks.
Experiment Setup Yes We finetune a Q-Former module [48] to align features from different encoders to the LLM input space. More dataset and optimization details are provided in the supplementary material.