reproducibilityindex.ai

Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding

Authors: Yunze Man, Shuhong Zheng, Zhipeng Bao, Martial Hebert, Liangyan Gui, Yu-Xiong Wang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To address this issue, we present the first comprehensive study that probes various visual encoding models for 3D scene understanding, identifying the strengths and limitations of each model across different scenarios. Our evaluation spans seven vision foundation encoders, including image, video, and 3D foundation models. We evaluate these models in four tasks: Vision-Language Scene Reasoning, Visual Grounding, Segmentation, and Registration, each focusing on different aspects of scene understanding.
Researcher Affiliation	Academia	1 University of Illinois Urbana-Champaign 2 Carnegie Mellon University
Pseudocode	No	The paper does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	We promise that we will open-source the data and code after paper acceptance.
Open Datasets	Yes	We evaluate the performance on two challenging indoor 3D VQA datasets: Scan QA [5] and SQA3D [55].
Dataset Splits	Yes	We conduct the experiments on the Scan Net [20] segmentation dataset which has 1,201 and 312 scenes for training and validation, respectively, with a total of 20 semantic classes for evaluation.
Hardware Specification	Yes	All analyses and computations are performed on an NVIDIA A100 GPU.
Software Dependencies	No	The paper does not explicitly state specific version numbers for software dependencies such as programming languages, libraries, or frameworks.
Experiment Setup	Yes	We finetune a Q-Former module [48] to align features from different encoders to the LLM input space. More dataset and optimization details are provided in the supplementary material.