Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

AdaVideoRAG: Omni-Contextual Adaptive Retrieval-Augmented Efficient Long Video Understanding

Authors: Xue zhucun, Jiangning Zhang, Xie Xurong, Yuxuan Cai, Yong Liu, Xiangtai Li, Dacheng Tao

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To demonstrate the effectiveness of the proposed Ada Video RAG framework, we officially release Hi VU, the first open benchmark dataset for full-stack capability evaluation in video understanding. Extensive experiments show that our framework enhances the overall efficiency and accuracy of Video-QA for long videos and can be seamlessly integrated with existing MLLMs via lightweight API calls, establishing a new paradigm for adaptive retrieval augmentation in video analysis.
Researcher Affiliation Collaboration 1Zhejiang University 2Youtu Lab, Tencent 3Huazhong University of Science and Technology 4Nanyang Technological University
Pseudocode No The paper describes the methodology in prose and with diagrams (e.g., Figure 2) but does not include any structured pseudocode or algorithm blocks with formal steps, loops, or conditional statements.
Open Source Code Yes Code: https://github.com/xzc-zju/Ada Video RAG
Open Datasets Yes To demonstrate the effectiveness of the proposed Ada Video RAG framework, we officially release Hi VU, the first open benchmark dataset for full-stack capability evaluation in video understanding. Compared with traditional datasets such as Activity Net [2] (single action recognition) and Movie QA [37] (open-ended QA), this benchmark achieves, for the first time, cognitive complexity evaluation at different levels, providing a hierarchical evaluation framework for video understanding research. We conduct comprehensive evaluations of the proposed Ada Video RAG method and the effectiveness of each module primarily on the newly proposed Hi VU benchmark Sec. 2.5, and also introduce public video understanding benchmarks for further thorough assessment. Specifically: 1) Hi VU includes over 10 sub-genres across 3 domains, comprising 120 knowledge-rich long-video datasets totaling 60 hours. 2) Video-MME [14] is a full-spectrum multi-modal evaluation benchmark for MLLMs in video analysis. 3) MLVU [53] is a multi-task benchmark for evaluating long-video understanding with diverse genres and extended durations.
Dataset Splits No The paper refers to existing benchmarks like 'MLVU_test' and the newly released 'Hi VU' benchmark. While these benchmarks imply predefined splits (e.g., test set for MLVU), the paper itself does not explicitly provide the numerical percentages or sample counts for training, validation, or test splits for any of the datasets used within its main text.
Hardware Specification Yes (2) Single-Process Inference. On a single H20 GPU, the average response times are 8 s (Level-1), 26 s (Level-2), and 27 s (Level-3). (3) Parallelization. To further improve deployment efficiency, we introduced multi-process and multi-GPU parallelism. Using dual processes on a single H20 GPU (96 GB, batch size = 2), database construction for Level-2 and Level-3 achieved 2 acceleration, reducing the time to 176 s and 210 s. Scaling to 8 GPUs yielded near-linear speedup ( 8 ), cutting construction time to 22 s and 26 s.
Software Dependencies No Specifically, auxiliary text extraction includes three categories: 1) The quantized Mini CPM-V [46] (used as the VLM model) generates fine-grained text descriptions TC for the sampled frames... 2) Audio is the most direct information carrier in videos... Therefore, we use Fast Whisper [33] as the audio extractor to convert the audio in each clip into text format TA... 3) Characters TO in each frame are extracted through EASYOCR [22]... Specifically, BGE-M3 [5] extracts entities and relationships from text chunks... Image Bind [16] image encoder (Enc. in Fig. 2) to extract features from key frames... The pre-trained cross-modal semantic alignment encoder Image Bind [16] is employed... a small-scale LLM (Qwen2.5-7B [45, 18] in the paper) for fine-grained filtering... In this paper, we adopt Qwen2.5-7B [18, 45], whose time-consuming proportion is extremely small (averagely 5%) compared to the entire process.
Experiment Setup Yes The input long video V is divided into N consecutive and semantically complete clips V = (C1, C2, . . . , CN) = {Cn} at fixed time intervals (30 seconds per clip in the paper). For each clip Cn, uniform frame sampling is performed to extract key frames. In this paper, we select 5 frames as the multimodal representation primitive Fn, as more frames do not significantly improve performance but increase computational power and model complexity. By calculating the cosine similarity between text and visual embeddings, candidate segments with similarity scores exceeding a threshold (set to 0.5 in this paper) are filtered out. Using dual processes on a single H20 GPU (96 GB, batch size = 2).