Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
FlexSelect: Flexible Token Selection for Efficient Long Video Understanding
Authors: yunzhu zhang, Yu Lu, Tianyi Wang, Fengyun Rao, Yi Yang, Linchao Zhu
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, Flex Select delivers strong gains across multiple long-video benchmarks including Video MME, MLVU, Long VB, and LVBench. Morever, it achieves significant speed-ups (e.g., up to 9 on a LLa VA-Video-7B model), highlighting Flex Select s promise for efficient long-form video understanding. |
| Researcher Affiliation | Collaboration | Yunzhu Zhang1 Yu Lu1 Tianyi Wang3 Fengyun Rao3 Yi Yang1,2 Linchao Zhu1,2 1The College of Computer Science and Technology, Zhejiang University 2The State Key Lab of Brain-Machine Intelligence, Zhejiang University 3We Chat Vision, Tencent Inc. |
| Pseudocode | No | The paper describes the method using textual explanations and diagrams (Figure 3 and Figure 4) but does not include explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | Project page: https://yunzhuzhang0918.github.io/flex_select. [...] We have released our code and trained model weights. |
| Open Datasets | Yes | We evaluate it on four challenging long-video understanding benchmarks Video MME, MLVU, Long VB, and LVBench [...] (1) Long Video Bench [41], (2) MLVU [54], (3) Video MME [12], and (4) LVBench [38]. |
| Dataset Splits | Yes | We randomly select a small ( 5%) subset of LLa VA-Video-178K [53] as the training data, which contains about 67k video instruction samples. [...] We randomly sample 128 videos from the Video MME [12] test set and insert each needle-query pair into them, resulting in a total of 640 test samples. |
| Hardware Specification | Yes | The evaluations in main Table 1 are conducted under LMMS-Eval [48] framework on 8 96G H20 GPUs. |
| Software Dependencies | No | The paper mentions using models like LLaVA-Video, Intern VL, Qwen-VL, and the LMMS-Eval framework, but does not provide specific version numbers for programming languages or libraries (e.g., Python, PyTorch, CUDA). |
| Experiment Setup | Yes | We set the input frames number N to 1024, 512, 512 for Qwen2.5VL(7B/72B), LLa VA-Video(7B/72B), and Intern VL2.5-8B respectively, and max subset frames number S to 64 for all models. [...] Token selectors for LLa VA-Video and Intern VL2.5 were trained for 1 epoch, while the token selector for Qwen2.5VL was trained for 3 epochs. |