Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

VideoLucy: Deep Memory Backtracking for Long Video Understanding

Authors: Jialong Zuo, Yongtai Deng, Lingdong Kong, Jingkang Yang, Rui Jin, Yiwei Zhang, Nong Sang, Liang Pan, Ziwei Liu, Changxin Gao

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate the superiority of Video Lucy. Built on open-source models, Video Lucy significantly outperforms state-of-the-art methods on multiple long video understanding benchmarks, achieving performance even surpassing the latest proprietary models such as GPT-4o.
Researcher Affiliation	Collaboration	1 National Key Laboratory of Multispectral Information Intelligent Processing Technology, School of Artificial Intelligence and Automation, Huazhong University of Science and Technology 2NUS 3S-Lab, NTU 4Shanghai AI Lab EMAIL
Pseudocode	Yes	Algorithm 1 The Iterative Backtracking Mechanism
Open Source Code	No	Our code and dataset will be made publicly available. and To ensure the proper use of the technology, we would not provide open access to our data and code during the paper submission process.
Open Datasets	Yes	In addition, we introduce Ego Mem, a new benchmark for long video understanding. Built on Ego Life [51], Ego Mem comprehensively assesses a model s temporal understanding and fine-grained detail perception in extremely long videos. Following common practices, we conduct experiments mainly on three existing long video benchmarks. MLVU [67] is a comprehensive benchmark... Video-MME [11] contains 2,700 manually annotated questions... LVBench [44] is designed for ultra-long video understanding...
Dataset Splits	Yes	MLVU. Since MLVU is a comprehensive video understanding benchmark with a wide range of video durations, we divided it by video length: 0-600s (short), 600-1200s (medium), 1200-3600s (long), and >3600s (extra-long). Video-MME. This benchmark has been divided into short, medium, and long splits by the original authors. Following convention, we adopt the official default splits. LVBench. It is a benchmark specifically designed for long video understanding, yet it still encompasses a relatively wide range of video durations. We also divide it by video length: 1800-3600s (short), 3600-5400s (medium), and >5400s (long).
Hardware Specification	Yes	In our implementation, we deployed Qwen2.5-VL-7B locally on 8 A100 GPUs using v LLM, achieving efficient batch processing that generates textual descriptions for one-hour videos within dozens of seconds.
Software Dependencies	No	In our implementation, we deployed Qwen2.5-VL-7B locally on 8 A100 GPUs using v LLM, achieving efficient batch processing that generates textual descriptions for one-hour videos within dozens of seconds. and we consistently use the open-sourced models Qwen-2.5-VL-7B [2] and Deep Seek-R1 [15]...
Experiment Setup	Yes	The temporal scopes Tc, Tf, Tuf have distinct settings for different video benchmarks. For each video, the temporal scopes are set as follows: coarse-grained memory (Tc = 800s), fine-grained memory (Tf = 80s), and ultra-fine-grained memory (Tuf = 8s). The frame sampling rates are respectively set to 0.25 FPS, 0.5 FPS, and 1 FPS for the three memory types. the best performance is achieved when the number of iterations is set to 5, which is set as our default value.