Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding
Authors: Yunze Man, Shuhong Zheng, Zhipeng Bao, Martial Hebert, Liangyan Gui, Yu-Xiong Wang
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To address this issue, we present the first comprehensive study that probes various visual encoding models for 3D scene understanding, identifying the strengths and limitations of each model across different scenarios. Our evaluation spans seven vision foundation encoders, including image, video, and 3D foundation models. We evaluate these models in four tasks: Vision-Language Scene Reasoning, Visual Grounding, Segmentation, and Registration, each focusing on different aspects of scene understanding. |
| Researcher Affiliation | Academia | 1 University of Illinois Urbana-Champaign 2 Carnegie Mellon University |
| Pseudocode | No | The paper does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | We promise that we will open-source the data and code after paper acceptance. |
| Open Datasets | Yes | We evaluate the performance on two challenging indoor 3D VQA datasets: Scan QA [5] and SQA3D [55]. |
| Dataset Splits | Yes | We conduct the experiments on the Scan Net [20] segmentation dataset which has 1,201 and 312 scenes for training and validation, respectively, with a total of 20 semantic classes for evaluation. |
| Hardware Specification | Yes | All analyses and computations are performed on an NVIDIA A100 GPU. |
| Software Dependencies | No | The paper does not explicitly state specific version numbers for software dependencies such as programming languages, libraries, or frameworks. |
| Experiment Setup | Yes | We finetune a Q-Former module [48] to align features from different encoders to the LLM input space. More dataset and optimization details are provided in the supplementary material. |