Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Scaling RL to Long Videos
Authors: Yukang Chen, Wei Huang, Baifeng Shi, Qinghao Hu, Hanrong Ye, Ligeng Zhu, Zhijian Liu, Pavlo Molchanov, Jan Kautz, Xiaojuan Qi, Sifei Liu, Hongxu Yin, Yao Lu, Song Han
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In our experiments, Long VILA-R1-7B achieves strong performance on video benchmarks, reaching 65.1% and 71.1% accuracy on Video MME without and with subtitles, respectively, and consistently outperforming Long VILA-7B across multiple benchmarks. Notably, our MR-SP system achieves up to 2.1 speedup on long video RL training. |
| Researcher Affiliation | Collaboration | Yukang Chen1 Wei Huang1,3 Baifeng Shi1,4 Qinghao Hu2 Hanrong Ye1 Ligeng Zhu1 Zhijian Liu1 Pavlo Molchanov1 Jan Kautz1 Xiaojuan Qi3 Sifei Liu1 Hongxu Yin1 Yao Lu1 Song Han1,2 1NVIDIA 2MIT 3HKU 4UC Berkeley |
| Pseudocode | No | The paper describes methods using mathematical formulations (Equations 1 and 2) but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code and models are available at https://github.com/NVlabs/Long-RL and https://huggingface.co/Efficient-Large-Model/Long VILA-R1-7B. |
| Open Datasets | Yes | We introduce a full-stack framework that scales up reasoning in vision-language models (VLMs) to long videos, leveraging reinforcement learning. We address the unique challenges of long video reasoning by integrating three critical components: (1) a large-scale dataset, Long Video-Reason, comprising 104K long video QA pairs with high-quality reasoning annotations across diverse domains such as sports, games, and vlogs... Additionally, an extra 102K samples from other datasets [53, 46, 31, 18, 44] are incorporated to scale up the RL. Code and models are available at https://github.com/NVlabs/Long-RL |
| Dataset Splits | Yes | We use 36K high quality samples for Long-Co T-SFT to initialize the model s reasoning and instruction-following abilities, and 68K samples with an additional 102K video data [53, 46, 31, 18, 44] for reinforcement learning. |
| Hardware Specification | Yes | On a single A100 node (8 GPUs), it supports RL training on hour-long videos (e.g., 3,600 frames). This entire data procedure consumes about 80,000 H100 GPU hours. We conduct the training efficiency comparison for our MR-SP system, one A100 node, i.e., 8x A100 (80GB) GPUs. |
| Software Dependencies | No | It incorporates a v LLM engine [19] tailored for Long VILA and a caching scheme for video embeddings. The MR-SP system alleviates the problem of intensive memory and facilitates RL training of long video VLMs. leveraging Ray [28] for efficient data flow and v LLM [19] for faster sampling. (No specific version numbers are provided for vLLM or Ray). |
| Experiment Setup | Yes | G is set as 8 in our experiments, and the sampled rewards above are normalized to get the advantages (Ai) for updating the model. We use Long VILA-7B-R1 model with training batch size as 1 per GPU and rollout number as 5. |