Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
DToMA: Training-free Dynamic Token MAnipulation for Long Video Understanding
Authors: Bowen Yuan, Sisi You, Bing-Kun Bao
IJCAI 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on 6 long video understanding benchmarks show that DTo MA enhances both efficiency and comprehension, outperforming stateof-the-art methods and generalizing well across 3 Video LLM architectures and sizes. |
| Researcher Affiliation | Academia | Bowen Yuan1 , Sisi You1,3 , Bing-Kun Bao1,2 1Nanjing University of Posts and Telecommunications, 2Pengcheng Laboratory, 3State Key Laboratory of Tibetan Intelligence. EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1: Token Reorganization Strategy |
| Open Source Code | Yes | Code is available at https://github.com/yuanrr/DTo MA. |
| Open Datasets | Yes | We conducted evaluations of our method on 6 long video understanding benchmarks, including Video MME [Fu et al., 2024a], Long Video Bench [Wu et al., 2024], Ego Schema [Mangalam et al., 2023], MLVU [Zhou et al., 2024], NEx TQA [Xiao et al., 2021], and Perception Test [Patraucean et al., 2024]. |
| Dataset Splits | No | Following evaluation tool LMMs-Eval [Zhang et al., 2024a], we perform standardized evaluation settings and metrics, i.e., accuracy, on each benchmark. The paper refers to 'standardized evaluation settings' and external benchmarks but does not explicitly provide specific split percentages, sample counts, or detailed splitting methodology within its main text. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory amounts used for running its experiments. It mentions the models and architectures used (e.g., LLa VA-Video-7B, Sig LIP, Qwen2) but not the underlying computational hardware. |
| Software Dependencies | No | The paper mentions using LLa VA-Video-7B, Sig LIP, and Qwen2 as core components but does not provide specific version numbers for these or other ancillary software libraries or programming languages required for replication. |
| Experiment Setup | Yes | For DTo MA, the selected layer r1 = 3, r2 [12, 18], r3 = 21. For TKR, following optimal design [Du et al., 2024] for Sig LIP, we use 2 2 pooling for keyframes, while 3 3 for coarse non-keyframes. Token budget B is pre-defined according to experimental requirements, and token compression ratio is self-adaptive according to B. With no specifically stated, we set m = S = N/4. For V-Inj, we set threshold G = 0.75, factor α = 0.25. |