Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding
Authors: Ye Wang, Ziheng Wang, Boshen Xu, Yang Du, Kejun Lin, Zihan Xiao, Zihao Yue, Jianzhong Ju, Liang Zhang, Dingyi Yang, Xiangnan Fang, Zewen He, Zhenbo Luo, Wenxuan Wang, Junqi Lin, Jian Luan, Qin Jin
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 4 Experiments We first present our experimental setup in Section 4.1. Then, we evaluate our model from three key perspectives: (1) Comparison with state-of-the-art methods in Section 4.2: We evaluate our model across multiple TVG benchmarks to assess its performance against existing approaches; (2) Ablation studies and analyses in Section 4.3: We examine the individual contributions of each component in our framework to better understand their roles in overall performance. |
| Researcher Affiliation | Collaboration | 1AIM3 Lab, Renmin University of China 2Mi LM Plus, Xiaomi Inc |
| Pseudocode | No | The paper describes the Time-R1 framework and training procedures in Sections 3.2 and 3.3, but does not present them in a structured pseudocode or algorithm block. |
| Open Source Code | No | Code and data will be released to reproducing or verifying the results. We provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results. |
| Open Datasets | Yes | Our training videos are sourced from Internet video datasets including YT-Temporal [59], Di De Mo [3], Quer YD [40], Intern Vid [52], and How To100M [38]. We obtain grounding data with annotations from VTG-IT [17], Time IT [47], Time Pro [65], HTStep [2], and Long Vid [28]. This process yields 339K temporal grounding samples. To ensure a comprehensive evaluation, we construct our TVGBench with curating samples from five public benchmarks with a balanced distribution of data source: Charades-STA [49], Activity Net-Captions [5], Hi REST [64], Ego NLQ [16], and Ta Co S [46]. |
| Dataset Splits | Yes | Benchmarks. We evaluate our model on a wide range of benchmarks covering both temporal video grounding and general video understanding tasks, including: (1) Charades-STA [49] contains 6,672 long videos capturing indoor human activities. The official split for the TVG task includes 12,408 clip-query pairs for training and 3,720 for testing. (2) Activity Net [5] comprises 20K long videos with an average of 3.65 clip-query pairs per video. Following previous work in fine-tuning setting [67, 23] for the TVG task, we use the standard dataset splits with 37,421 training, 17,505 validation, and 17,031 test samples. |
| Hardware Specification | No | Table 6: Inference speed comparison between Hugging Face Transformers and v LLM libraries. Speeds are reported as (with Co T / without Co T) with 8 GPUs. ...v LLM delivers substantial performance gains across all datasets. For example, on the TVGBench benchmark, inference time with Co T is reduced from 42 minutes to just 8.3 minutes, achieving over a 5 speedup. |
| Software Dependencies | No | Unless otherwise specified, we use Qwen2.5-VL-7B [4] as the base model. Inspired by DAPO [61], we adopt its token-level loss for training, rather than the sample-level loss used in GRPO. We full-finetune the LLM parameters at every step, thus πθ(oi) πθold(oi) = 1. The sample number G is set to 8. The coefficient β is set to 0.04. We use a learning rate of 1e-6 with the Adam W optimizer with β1=0.9, β2 = 0.999, and a linear scheduler to decay the learning rate from 1e-6 to 0. We implemented an accelerated inference version using v LLM [24] for all related 7 downstream benchmarks. We implemented two versions of SFT fine-tuning: one is full-parameter fine-tuning of the LLM (SFT), and the other is Lo RA-based fine-tuning of the LLM (SFT-Lo RA). For SFT-Lo RA, the Lo RA rank is set to 64, and the Lo RA alpha is set to 128. Both configurations use the following settings: a learning rate of 2e-5, the Adam W optimizer with β1=0.9, β2 = 0.999, a weight decay of 0, the batch size of 8, and accumulation steps of 2. |
| Experiment Setup | Yes | Implementation details. Unless otherwise specified, we use Qwen2.5-VL-7B [4] as the base model. To strike a balance between training efficiency and memory consumption, we sample video frames at 2 FPS and adaptively resize each video input to contain approximately 2.8 million pixels. For instance, a 50-second video yields 100 frames, each with a resolution of roughly 96 96 3. During the reinforcement fine-tuning phase, we train for 5 epochs using a batch size of 8 and select the final checkpoint for evaluation. For fine-tuning on downstream benchmarks, we train for 2 epochs. More implementation details are provided in Appendix B. Details of Time-R1 framework. Inspired by DAPO [61], we adopt its token-level loss for training, rather than the sample-level loss used in GRPO. Apart from minor changes to the loss, all setting is identical to GRPO. Besides, we find that other techniques introduced in DAPO do not benefit the TVG task, thus aborting other techniques. We full-finetune the LLM parameters at every step, thus πθ(oi) πθold(oi) = 1. The sample number G is set to 8. The coefficient β is set to 0.04. Details of Time RFT training. For RFT data filtering, we use a Gaussian distribution with a fixed variance of 0.2, while varying the mean to control sample selection. In our cold start phase, we construct 150 samples from our training data sources (e.g., YT-Temporal [59]) to fine-tune the LLM using Lo RA [21], with a Lo RA rank of 64 and a Lo RA alpha of 128. All of our results are reported based on the final training epoch. For RL, we use a learning rate of 1e-6 with the Adam W optimizer with β1=0.9, β2 = 0.999, and a linear scheduler to decay the learning rate from 1e-6 to 0. We use a batch size of 8 with gradient accumulation set to 2. |