Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Recurrent Attention-based Token Selection for Efficient Streaming Video-LLMs
Authors: Evangelos Dorovatas, Soroush Seifi, Gunshi Gupta, Rahaf Aljundi
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our method on three offline and streaming benchmarks, achieving state-of-the-art performance with significantly lower memory requirements. Our contributions are as follows: We propose Recurrent LLM-informed Visual Selection (r Li VS), a simple, training-free approach for long video understanding and question answering. Our approach is agnostic to the Video-LLM architecture and does not require any external modules. We achieve state-of-the-art performance with significantly lower memory requirements. In the following, we discuss the closely related work in Section 2 and present our method in Section 3. We evaluate our method and ablate the design choices in Section 4 and conclude in Section 5. |
| Researcher Affiliation | Collaboration | Vaggelis Dorovatas1,2 Soroush Seifi1 Gunshi Gupta3 Rahaf Aljundi1 1Toyota Motor Europe 2Archimedes RU, Athena RC 3University of Oxford |
| Pseudocode | Yes | 3.3 Efficient Video Question Answering Algorithm 1 Streaming Video Processing and Query Answering with r Li VS Streaming Video Processing 1: Ml [ ], Ms queue(), B [ ] 2: MAX_MEM 16, CLIP_SIZE 16 3: while frames available do 4: B.append(get_next_frame()) 5: if length(B) == CLIP_SIZE then 6: context Ms + B 7: B.clear() 8: S, C Attn_Selection(context) 9: if length(Ms) == MAX_MEM then 10: Ms.pop_left() 11: end if 12: Ms.append(S) 13: Ml.append(C) 14: end if 15: end while Query Answering 16: Q embed(query) 17: C Retrieve_Top K(Q, Ml) 18: context C + Q 19: answer LLM_Generate_Answer(context) |
| Open Source Code | No | Answer: [No] Justification: We will release the code after finalizing the patent. |
| Open Datasets | Yes | We evaluate our method s effectiveness in online scenarios using the Realtime VStream-QA benchmark [38], which includes RVS-Movie (emphasizing plot understanding) and RVS-Ego (focusing on visual comprehension) both featuring 40-minute videos with diverse open-ended questions. To demonstrate robustness, we also report results on offline benchmarks: Movie Chat [28] (170 videos averaging 576 seconds across various genres with 510 questions testing long-range comprehension), offline VStream-QA (VS-Movie and VS-Ego), and CG-Bench [5] (1,219 videos averaging 27 minutes with 12K multiple-choice questions). Additionally, we conduct an ablation on Next QA-valset [36] (570 shorter videos averaging 44 seconds with 5K multiple-choice questions) to validate our attention-based visual token selection approach. |
| Dataset Splits | Yes | Additionally, we conduct an ablation on Next QA-valset [36] (570 shorter videos averaging 44 seconds with 5K multiple-choice questions) to validate our attention-based visual token selection approach. ... Using the MLVU validation split [39], we compute the holistic video summarization score and compare our approach to Video-XL [27], a recent training-based streaming video understanding model that also employs recurrent mechanisms. |
| Hardware Specification | Yes | Benchmarkings are conducted on a single A100 GPU. ... Resources: Across all experiments, we utilize one or two NVIDIA A100 GPUs with 40GB of memory. Latency and VRAM measurements are conducted on a single GPU. |
| Software Dependencies | No | Modern fast attention implementations (e.g., Flash Attention-2 [8]) avoid explicitly materializing the full N 2 attention matrix (with N as the sequence length) to reduce memory and computational cost. ... For benchmarks with open-ended questions, we report both Accuracy and a Score based on GPT-3.5 evaluation, following prior works (we use gpt-3.5-turbo-0613, consistent with the evaluation setup of Re KV [9]). |
| Experiment Setup | Yes | Implementation Details. We implement our method on LLa VA-One Vision [17], a strong VLM for image and video tasks, enabling direct comparison with Re KV (current state-of-the-art on RVS benchmarks). We demonstrate versatility by evaluating both 7B and 0.5B variants, with 7B used unless specified otherwise. Since LLa VA-OV is trained with 32 frames (196 visual tokens each), we allocate 16 frames for current short clips and 16 for short-term recurrent memory. We select only 196 visual tokens from 3,136 available (16 196), retaining just 6.25% of total visual information per short clip. Following previous works, we process RVS-Movie and RVS-Ego at 0.5 FPS [38, 9], Movie Chat at 1 FPS, and CG-Bench and offline VS-Stream at 0.5 FPS, with 10K context tokens for retrieval and generation. We average attention scores from 4 (of 28) backbone layers across experiments. |