Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Vgent: Graph-based Retrieval-Reasoning-Augmented Generation For Long Video Understanding

Authors: Xiaoqian Shen, Wenxuan Zhang, Jun Chen, Mohamed Elhoseiny

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We comprehensively evaluate our framework with various open-source LVLMs on three long-video understanding benchmarks. Our approach yielded an overall performance improvement of 3.0% 5.4% over base models on MLVU, and outperformed state-of-the-art video RAG methods by 8.6%. Our code is publicly available at https://xiaoqian-shen.github.io/Vgent. Section 4: Experiments. Experimental results demonstrate that our framework consistently improves the performance of existing LVLMs by 3.0% 5.4%.
Researcher Affiliation	Collaboration	Xiaoqian Shen1, Wenxuan Zhang1, Jun Chen1,2, Mohamed Elhoseiny1 1King Abdullah University of Science and Technology, 2Meta AI EMAIL
Pseudocode	No	The paper describes the methodology in Section 3 'Methodology' using prose and pipeline diagrams (e.g., Figure 2). It does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps in a code-like format.
Open Source Code	Yes	Our code is publicly available at https://xiaoqian-shen.github.io/Vgent.
Open Datasets	Yes	We comprehensively evaluate our framework with various open-source LVLMs on three long-video understanding benchmarks. Our approach yielded an overall performance improvement of 3.0% 5.4% over base models on MLVU, and outperformed state-of-the-art video RAG methods by 8.6%. Benchmarks. We evaluate the performances of each model across three long-video benchmarks. Video-MME [13] is a widely used benchmark... MLVU [57] is a long-video understanding benchmark... Long Video Bench (LVB) [50] focuses on referred reasoning tasks...
Dataset Splits	Yes	Benchmarks. We evaluate the performances of each model across three long-video benchmarks. Video-MME [13]... MLVU [57]... Long Video Bench (LVB) [50]... For MLVU [57], we extract spoken content using openai/whisper-large, while for Video MME [13] and Long Video Bench [50], we use the provided subtitles from benchmark.
Hardware Specification	Yes	All experiments are conducted on A100 80G GPUs.
Software Dependencies	No	The paper mentions tools like 'BAAI/bge-large-en-v1.5' for embeddings and 'openai/whisper-large' for spoken content extraction. However, it does not specify software dependencies with explicit version numbers (e.g., Python 3.x, PyTorch 1.x).
Experiment Setup	Yes	During the offline video graph construction, we sample videos at 1.0 FPS, segmenting the long video into clips, each containing K = 64 frames. We use the BAAI/bge-large-en-v1.5 [51] embedding for similarity calculation. The entity merging threshold is set to τ = 0.7. In the online retrieval stage, we use BAAI/bge-large-en-v1.5 to retrieve the top N = 20 clips based on extracted keywords (maximum to 20 to discard low-relevance, with a similarity threshold θ = 0.5). After structured query refinement, we retain a maximum of r = 5 clips. Thresholds are set as the same for all three benchmarks, with hyper-parameter selection details provided in the supplementary.