Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension

Authors: Yongdong Luo, Xiawu Zheng, Guilin Li, Shukang Yin, Haojia Lin, Chaoyou Fu, Jinfa Huang, Jiayi Ji, Fei Chao, Jiebo Luo, Rongrong Ji

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate Video-RAG across several long video benchmarks, including Video-MME [6], MLVU [54], and Long Video Bench [43]. By applying the Video-RAG to seven distinctive open-source LVLMs, we achieve an average performance improvement of 2.8% on Video-MME with only 2.0K text tokens addition (equal to 14 frames in most configuration) per case, while beating the proprietary LVLM when integrated with the 72B model, as shown in the right part of Figure 1.
Researcher Affiliation Academia Yongdong Luo1 Xiawu Zheng1 Guilin Li1 Shukang Yin Haojia Lin1 Chaoyou Fu2 Jinfa Huang3 Jiayi Ji1 Fei Chao1 Jiebo Luo3 1Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China 2 Nanjing University 3 University of Rochester
Pseudocode No The paper describes its methodology through text and a framework diagram (Figure 2), but does not include explicit pseudocode or an algorithm block.
Open Source Code Yes Codes are available at https://github.com/Leon1207/Video-RAG-master.
Open Datasets Yes Video-MME [6] is a widely used benchmark for assessing the ability of LVLMs to handle detailed videos in real-world scenarios. It is divided into three subsets based on video length, with durations ranging from 11 seconds to 1 hour. MLVU [54] is a long video understanding benchmark with a large wide of 9 distinct tasks. It is created based on long videos of diversified lengths, ranging from 3 minutes to 2 hours with about 12 minutes average video length. Long Video Bench [43] is a benchmark designed to accurately retrieve and reason over detailed multimodal information from long videos, with 6,678 human-annotated multiple-choice questions in 17 fine-grained categories.
Dataset Splits Yes We selected 10% of the full dataset, comprising 30 short, 30 medium-length, and 30 long videos.
Hardware Specification Yes We performed all experiments on NVIDIA A100 80G GPUs.
Software Dependencies No The paper mentions software tools like Easy OCR [10], Contriever [9], FAISS [13], Whisper [31], and APE [36] but does not provide specific version numbers for these components.
Experiment Setup Yes In the auxiliary text retrieval phase, we set both the CLIP similarity threshold and the FAISS similarity threshold t to 0.3. We employ the Index Flat IP as the similarity calculating method of FAISS [13].