Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Video-R1: Reinforcing Video Reasoning in MLLMs

Authors: Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, Xiangyu Yue

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate that Video-R1 achieves significant improvements on video reasoning benchmarks such as Video MMMU and VSI-Bench, as well as on general video benchmarks including MVBench and Temp Compass, etc.
Researcher Affiliation Academia Kaituo Feng1, Kaixiong Gong1, Bohao Li2, Zonghao Guo3 , Yibing Wang4, Tianshuo Peng1, Junfei Wu4, Xiaoying Zhang5, Benyou Wang2, Xiangyu Yue1 1CUHK MMLab, 2CUHK (SZ), 3Tsinghua University, 4UCAS, 5CUHK HCCL
Pseudocode No The core idea behind T-GRPO is to compare the model s performance on the same video question when frames are provided in two different orders: (1) the temporally ordered sequence, and (2) a randomly shuffled version. For each input question, we generate two groups of responses {oi}G i=1 and { oi} G i=1 using the ordered and shuffled frame inputs, respectively. Let p and p denote the proportion of correct answers in each group. We then define a temporal reward rt as: rt = α, if p p 0, otherwise (1) where α is a hyperparameter controlling the magnitude of the temporal reward. Here we set α = 0.3. This contrastive design encourages the model to perform better when the video is presented in correct temporal order than when it is shuffled. The model is only granted this positive reward if its current reasoning strategy for a given question demonstrates a reliance on temporal information. For tasks with continuous rewards (e.g., free-form answers), a threshold (e.g., 0.5) can be used to determine whether a response is considered correct. Importantly, rt is only applied to correct responses to ensure meaningful positive advantages. Applying it to all responses would dilute the reward signal and hinder effective learning. In other words, when the model s reasoning policy successfully relies on temporal patterns, correct responses are reinforced with a higher reward, while incorrect ones remain unaffected. Formally, the temporal-augmented reward is defined as: Ri = ri + rt, if oi is correct ri, otherwise (2) where ri is the reward for response i, containing both the correctness reward and the format reward, following [11]. Ri is the final reward used for calculating advantages. This reward shaping ensures that when the model answers correctly under a temporal setting but fails to outperform the shuffled baseline, it receives no additional reward pushing the optimization toward adopting a more temporally aware reasoning policy. The temporal reward rt could also be added to the advantages directly. Then, the advantage Ai is computed over the rewards within each group: Ai = Ri mean({Rj}) std({Rj}) (3) Following Deep Seek R1 [11], the final policy update is as follows: JT-GRPO(θ) = Eq,{oi} min πθ(oi|q) πθold(oi|q)Ai, clip πθ(oi|q) πθold(oi|q), 1 ϵ, 1 + ϵ Ai β DKL(πθ πref) # (4) By explicitly comparing the model s performance under ordered and shuffled inputs, T-GRPO introduces a contrastive training signal that drives the model to prefer reasoning strategies that leverage temporal patterns. It is worth noting that T-GRPO is only employed for video-based inputs in the training process of Video-R1.
Open Source Code Yes All code, models, and data are released in https://github.com/tulerfeng/Video-R1.
Open Datasets Yes We have constructed two datasets: Video-R1-Co T-165k for SFT cold start and Video-R1-260k for RL training, both comprising image and video data. All code, models, and data are released in https://github.com/tulerfeng/Video-R1.
Dataset Splits No We have constructed two datasets: Video-R1-Co T-165k for SFT cold start and Video-R1-260k for RL training, both comprising image and video data.
Hardware Specification Yes We train our model using up to 8 NVIDIA A100 (80GB) GPUs.
Software Dependencies No We adopt Qwen2.5-VL-7B-Instruct [1] as the base MLLMs for training. We use the Adam optimizer with a learning rate of 1e-6 to train our model.
Experiment Setup Yes For efficiency considerations, we limit the maximum number of video frames to 16 during training. Each frame is processed at a max resolution of 128 28 28 pixels. During inference, we increase the frame resolution to 256 28 28 pixels and frames to 16 64 to enhance performance. The ordered group size G is set to 8 and the shuffled group size G is set to half of that for efficiency. More details are provided in Appendix D. (From Appendix D): We use the Adam optimizer with a learning rate of 1e-6 to train our model. The SFT stage takes approximately 40 hours per epoch, while the RL stage takes around 15 hours for 1k steps. The hyperparameter β in the KL divergence term of the GRPO algorithm is set to 0.04. To ensure training stability, we apply a weight decay rate of 0.01 and clip the maximum gradient norm to 5. The maximum response length is set to 768 tokens.