Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Towards Comprehensive Scene Understanding: Integrating First and Third-Person Views for LVLMs
Authors: Insu Lee, Wooje Park, Jaeyun Jang, Minyoung Noh, Kyuhong Shim, Byonghyo Shim
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate M3Co T on the E3VQA benchmark using state-of-the-art LVLMs, including GPT-4o [13] and Gemini 2.0 Flash [37], and observe a considerable gain in accuracy of 4.84% and 5.94% over a recent Co T baseline. Our extensive evaluation reveals key strengths and limitations of LVLMs in multi-view reasoning and highlights the value of leveraging both egocentric and exocentric inputs. Table 1 reports model accuracy across categories. |
| Researcher Affiliation | Academia | 1Seoul National University, 2Sungkyunkwan University EMAIL, EMAIL |
| Pseudocode | No | The paper describes the M3Co T method and E3VQA construction pipeline using descriptive text and flowcharts (Figure 3, Figure 4) but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The dataset and source code are available at https://github.com/Leeinsu1/ Towards-Comprehensive-Scene-Understanding. |
| Open Datasets | Yes | We present E3VQA, the first benchmark for multi-view question answering with 4K high-quality question-answer pairs grounded in synchronized ego-exo image pairs. The dataset and source code are available at https://github.com/Leeinsu1/ Towards-Comprehensive-Scene-Understanding. The E3VQA benchmark is constructed utilizing the large-scale synchronized ego-exo dataset, Ego Exo4D [12]. |
| Dataset Splits | Yes | Each category contains 1,000 question-answer (QA) pairs, evenly divided between egocentric and exocentric questions (e.g., What am I doing? vs. What is the person doing? ), which supports the evaluation of the model s generalization capability to diverse forms of user queries. Note that all video clips are selected from the test split to prevent any potential dataset contamination. |
| Hardware Specification | Yes | For evaluation, we use NVIDIA RTX A6000 GPUs. |
| Software Dependencies | No | The paper uses off-the-shelf LVLMs (GPT-4o, Gemini 2.0 Flash) and mentions utilizing GPT-4o for automated QA generation, but does not specify versioned software dependencies like Python, PyTorch, or CUDA versions for its own methodology. |
| Experiment Setup | Yes | We evaluate five closed-source and nine open-source LVLMs on the E3VQA benchmark using their default configurations. Detailed model specifications, experimental settings, as well as system and user prompt templates are provided in Appendices B.1 and E.2. All evaluation results are reported as the mean and standard deviation over three independent runs, using each model s default generation settings. |