Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Towards Comprehensive Scene Understanding: Integrating First and Third-Person Views for LVLMs

Authors: Insu Lee, Wooje Park, Jaeyun Jang, Minyoung Noh, Kyuhong Shim, Byonghyo Shim

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate M3Co T on the E3VQA benchmark using state-of-the-art LVLMs, including GPT-4o [13] and Gemini 2.0 Flash [37], and observe a considerable gain in accuracy of 4.84% and 5.94% over a recent Co T baseline. Our extensive evaluation reveals key strengths and limitations of LVLMs in multi-view reasoning and highlights the value of leveraging both egocentric and exocentric inputs. Table 1 reports model accuracy across categories.
Researcher Affiliation Academia 1Seoul National University, 2Sungkyunkwan University EMAIL, EMAIL
Pseudocode No The paper describes the M3Co T method and E3VQA construction pipeline using descriptive text and flowcharts (Figure 3, Figure 4) but does not include structured pseudocode or algorithm blocks.
Open Source Code Yes The dataset and source code are available at https://github.com/Leeinsu1/ Towards-Comprehensive-Scene-Understanding.
Open Datasets Yes We present E3VQA, the first benchmark for multi-view question answering with 4K high-quality question-answer pairs grounded in synchronized ego-exo image pairs. The dataset and source code are available at https://github.com/Leeinsu1/ Towards-Comprehensive-Scene-Understanding. The E3VQA benchmark is constructed utilizing the large-scale synchronized ego-exo dataset, Ego Exo4D [12].
Dataset Splits Yes Each category contains 1,000 question-answer (QA) pairs, evenly divided between egocentric and exocentric questions (e.g., What am I doing? vs. What is the person doing? ), which supports the evaluation of the model s generalization capability to diverse forms of user queries. Note that all video clips are selected from the test split to prevent any potential dataset contamination.
Hardware Specification Yes For evaluation, we use NVIDIA RTX A6000 GPUs.
Software Dependencies No The paper uses off-the-shelf LVLMs (GPT-4o, Gemini 2.0 Flash) and mentions utilizing GPT-4o for automated QA generation, but does not specify versioned software dependencies like Python, PyTorch, or CUDA versions for its own methodology.
Experiment Setup Yes We evaluate five closed-source and nine open-source LVLMs on the E3VQA benchmark using their default configurations. Detailed model specifications, experimental settings, as well as system and user prompt templates are provided in Appendices B.1 and E.2. All evaluation results are reported as the mean and standard deviation over three independent runs, using each model s default generation settings.