Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

VideoChat-R1.5: Visual Test-Time Scaling to Reinforce Multimodal Reasoning by Iterative Perception

Authors: Ziang Yan, Yinan He, Xinhao Li, Zhengrong Yue, Xiangyu Zeng, Yali Wang, Yu Qiao, Limin Wang, Yi Wang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments validate VTTS s effectiveness and generalization across diverse tasks and benchmarks. Our newly introduced Videochat-R1.5 model has achieved remarkable improvements, with an average increase of over 5%, compared to robust baselines such as Qwen2.5VL-3B and -7B, across more than 15 benchmarks that encompass video conversation, video reasoning, and spatio-temporal perception.
Researcher Affiliation Academia 1Zhejiang University 2Shanghai AI Laboratory 3Nanjing University 4Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences
Pseudocode No The paper describes the methodology and inference process in text (e.g., in Section 3 Methodology and Section 3.1 Learning Iterative Perception with Reinforcement Fine-Tuning) but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code No Answer: [No] Justification: We are currently organizing the code and data, and we promise to release all code and data on Git Hub in the future.
Open Datasets Yes To comprehensively evaluate the general capabilities of our models, we conduct experiments across a diverse suite of benchmarks. For video perception, we report results on MVBench [32] and Perception Test [44], which assesses fine-grained temporal understanding, including action types, sequences, and movement directions. General video understanding is assessed using Video MME [16], consisting of videos of short, medium, and long durations. In the long-video domain, we evaluate performance on MLVU [81], LVBench [52], and Long Video Bench [61]. Videobased knowledge modeling is quantified using the Video MMMU [23] benchmark. To evaluate the model s visual-spatial intelligence, we adopt the VSIBench [69]. Furthermore, we specifically assess spatio-temporal grounding capabilities using a range of detection and temporal grounding datasets. Grounded video QA task requires the model to not only provide accurate answers regarding videos, but also identify the specific temporal segments that support those answers. This task highlights the need for joint reasoning between semantic understanding and temporal context, leading to accurate and interpretable predictions. We evaluate our model on two grounded video QA benchmarks: Next GQA [63] and Re XTime [7].
Dataset Splits No The paper lists numerous datasets used (e.g., MVBench, Perception Test, Video MME, Next GQA, Charades-STA, Ref COCO, etc.) and describes how some data is processed (e.g., uniform frame sampling, dense/sparse sampling), but it does not specify explicit training, validation, or test splits (percentages, counts, or references to standard splits used for their experiments) for any of these datasets.
Hardware Specification No The paper mentions input data constraints like frame ranges and video/image resolutions in Section 8 "Training Details", but it does not specify any hardware details (e.g., CPU, GPU models, memory, cloud platforms) used for training or inference.
Software Dependencies No The paper mentions using an "AdamW optimizer" in Section 8 "Training Details" but does not specify any software libraries or frameworks with their version numbers (e.g., PyTorch version, Python version, CUDA version).
Experiment Setup Yes Implementation. We apply VTTS to the latest Qwen series like Qwen2.5-VL-7B and Qwen2.5VL-3B using VTTS-80K dataset with reinforcement fine-tuning(RFT). Training is performed with a learning rate of 2e-6 and a batch size of 16. The reward function comprise three components: format reward, clue reward (quantified by the IoU of visual clues), and answer reward. At inference time, the default number of iterative perception (ITP) iterations is set to 3. (...) Training Details. The VTTS RL training is configured with the following parameters. We use an AdamW optimizer with a learning rate of 2 * 10^-6, zero weight decay, and a linear learning rate schedule without warmup. The total batch size is set to 16.