Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Vgent: Graph-based Retrieval-Reasoning-Augmented Generation For Long Video Understanding
Authors: Xiaoqian Shen, Wenxuan Zhang, Jun Chen, Mohamed Elhoseiny
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We comprehensively evaluate our framework with various open-source LVLMs on three long-video understanding benchmarks. Our approach yielded an overall performance improvement of 3.0% 5.4% over base models on MLVU, and outperformed state-of-the-art video RAG methods by 8.6%. Our code is publicly available at https://xiaoqian-shen.github.io/Vgent. Section 4: Experiments. Experimental results demonstrate that our framework consistently improves the performance of existing LVLMs by 3.0% 5.4%. |
| Researcher Affiliation | Collaboration | Xiaoqian Shen1, Wenxuan Zhang1, Jun Chen1,2, Mohamed Elhoseiny1 1King Abdullah University of Science and Technology, 2Meta AI EMAIL |
| Pseudocode | No | The paper describes the methodology in Section 3 'Methodology' using prose and pipeline diagrams (e.g., Figure 2). It does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps in a code-like format. |
| Open Source Code | Yes | Our code is publicly available at https://xiaoqian-shen.github.io/Vgent. |
| Open Datasets | Yes | We comprehensively evaluate our framework with various open-source LVLMs on three long-video understanding benchmarks. Our approach yielded an overall performance improvement of 3.0% 5.4% over base models on MLVU, and outperformed state-of-the-art video RAG methods by 8.6%. Benchmarks. We evaluate the performances of each model across three long-video benchmarks. Video-MME [13] is a widely used benchmark... MLVU [57] is a long-video understanding benchmark... Long Video Bench (LVB) [50] focuses on referred reasoning tasks... |
| Dataset Splits | Yes | Benchmarks. We evaluate the performances of each model across three long-video benchmarks. Video-MME [13]... MLVU [57]... Long Video Bench (LVB) [50]... For MLVU [57], we extract spoken content using openai/whisper-large, while for Video MME [13] and Long Video Bench [50], we use the provided subtitles from benchmark. |
| Hardware Specification | Yes | All experiments are conducted on A100 80G GPUs. |
| Software Dependencies | No | The paper mentions tools like 'BAAI/bge-large-en-v1.5' for embeddings and 'openai/whisper-large' for spoken content extraction. However, it does not specify software dependencies with explicit version numbers (e.g., Python 3.x, PyTorch 1.x). |
| Experiment Setup | Yes | During the offline video graph construction, we sample videos at 1.0 FPS, segmenting the long video into clips, each containing K = 64 frames. We use the BAAI/bge-large-en-v1.5 [51] embedding for similarity calculation. The entity merging threshold is set to τ = 0.7. In the online retrieval stage, we use BAAI/bge-large-en-v1.5 to retrieve the top N = 20 clips based on extracted keywords (maximum to 20 to discard low-relevance, with a similarity threshold θ = 0.5). After structured query refinement, we retain a maximum of r = 5 clips. Thresholds are set as the same for all three benchmarks, with hyper-parameter selection details provided in the supplementary. |