Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

FlexSelect: Flexible Token Selection for Efficient Long Video Understanding

Authors: yunzhu zhang, Yu Lu, Tianyi Wang, Fengyun Rao, Yi Yang, Linchao Zhu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, Flex Select delivers strong gains across multiple long-video benchmarks including Video MME, MLVU, Long VB, and LVBench. Morever, it achieves significant speed-ups (e.g., up to 9 on a LLa VA-Video-7B model), highlighting Flex Select s promise for efficient long-form video understanding.
Researcher Affiliation	Collaboration	Yunzhu Zhang1 Yu Lu1 Tianyi Wang3 Fengyun Rao3 Yi Yang1,2 Linchao Zhu1,2 1The College of Computer Science and Technology, Zhejiang University 2The State Key Lab of Brain-Machine Intelligence, Zhejiang University 3We Chat Vision, Tencent Inc.
Pseudocode	No	The paper describes the method using textual explanations and diagrams (Figure 3 and Figure 4) but does not include explicit pseudocode or algorithm blocks.
Open Source Code	Yes	Project page: https://yunzhuzhang0918.github.io/flex_select. [...] We have released our code and trained model weights.
Open Datasets	Yes	We evaluate it on four challenging long-video understanding benchmarks Video MME, MLVU, Long VB, and LVBench [...] (1) Long Video Bench [41], (2) MLVU [54], (3) Video MME [12], and (4) LVBench [38].
Dataset Splits	Yes	We randomly select a small ( 5%) subset of LLa VA-Video-178K [53] as the training data, which contains about 67k video instruction samples. [...] We randomly sample 128 videos from the Video MME [12] test set and insert each needle-query pair into them, resulting in a total of 640 test samples.
Hardware Specification	Yes	The evaluations in main Table 1 are conducted under LMMS-Eval [48] framework on 8 96G H20 GPUs.
Software Dependencies	No	The paper mentions using models like LLaVA-Video, Intern VL, Qwen-VL, and the LMMS-Eval framework, but does not provide specific version numbers for programming languages or libraries (e.g., Python, PyTorch, CUDA).
Experiment Setup	Yes	We set the input frames number N to 1024, 512, 512 for Qwen2.5VL(7B/72B), LLa VA-Video(7B/72B), and Intern VL2.5-8B respectively, and max subset frames number S to 64 for all models. [...] Token selectors for LLa VA-Video and Intern VL2.5 were trained for 1 epoch, while the token selector for Qwen2.5VL was trained for 3 epochs.