Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
VideoRFT: Incentivizing Video Reasoning Capability in MLLMs via Reinforced Fine-Tuning
Authors: Qi (Cheems) Wang, Yanrui Yu, Ye Yuan, Rui Mao, Tianfei Zhou
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments show that VIDEORFT achieves state-of-the-art performance on six video reasoning benchmarks. The paper includes a dedicated section '4 Experiment' with subsections '4.1 Experimental Setup', '4.2 Main Result', and '4.3 Diagnostic Experiment', detailing empirical evaluation and performance metrics. |
| Researcher Affiliation | Academia | The authors are affiliated with 'Beijing Institute of Technology' and 'Shenzhen University', which are both academic institutions. No corporate affiliations or company email domains are listed. |
| Pseudocode | No | The paper describes its methodology using descriptive text and illustrative figures (e.g., Figure 1, Figure 3, Figure 5) but does not include any explicitly labeled pseudocode blocks or algorithms with structured, code-like steps. |
| Open Source Code | Yes | The paper provides a GitHub link 'https://github.com/Qi Wang98/Video RFT' on its first page. Additionally, the NeurIPS checklist states: 'Justification: The constructed datasets and training code with documentation will be opensourced upon paper acceptance'. |
| Open Datasets | Yes | The paper explicitly states: 'This pipeline results in two new datasets, i.e.Video RFT-Co T-102K for SFT and Video RFT-RL-310K for RL.' and 'Justification: The constructed datasets and training code with documentation will be opensourced upon paper acceptance'. It also lists various established open datasets used for data collection in Figure 2, such as 'LLa VA-Video-178k' and 'A-OKVQA', which are common benchmarks. |
| Dataset Splits | No | The paper mentions training on 'Video RFT-Co T-102K' and 'Video RFT-RL-310K' and evaluating on six video reasoning benchmarks, stating 'Following previous works [9, 18, 51]'. However, it does not explicitly provide the training, validation, and test splits (e.g., percentages or sample counts) for its newly constructed datasets or the specific splits used for the benchmarks. |
| Hardware Specification | Yes | The paper states under 'Implementation Details': 'We use Qwen2.5-VL-7B [1] as the base model and train VIDEORFT on 8 NVIDIA A800 GPUs, with 80GB each.' |
| Software Dependencies | No | The paper mentions: 'The RL training is implemented using the Hugging Face TRL library [39], and our codebase is built upon Open-R1 [7].' and 'For efficiency, we use a lightweight version of Sig LIP'. However, it does not provide specific version numbers for these libraries (TRL, Open-R1, Sig LIP). |
| Experiment Setup | Yes | Under 'Implementation Details', the paper specifies: 'the video input is limited to 16 frames, with each frame processed into 128 28 28 resolution during training... During inference, we increase the number of frames to 32 and the resolution to 256 28 28... The entire model is trained for one epoch of SFT followed by 1K steps of RL.' |