Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning of Vision Language Models

Authors: Huajie Tan, Yuheng Ji, Xiaoshuai Hao, Xiansheng Chen, Pengwei Wang, Zhongyuan Wang, Shanghang Zhang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results demonstrate three key advantages: (1) Performance Enhancement: achieving state-of-the-art results across multiple tasks, outperforming mainstream open-source and proprietary models; (2) Generalization Superiority: consistently maintaining robust performance in addressing domain shift in typical visual reasoning tasks, outperforming alternative paradigms; (3) Data Efficiency: excelling in few-shot learning scenarios while surpassing full-dataset SFT baselines.
Researcher Affiliation	Academia	1 State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University 2 Beijing Academy of Artificial Intelligence 3 Institute of Automation, Chinese Academy of Sciences 4 School of Artificial Intelligence, University of Chinese Academy of Sciences
Pseudocode	No	The paper describes the methodology with mathematical formulations and high-level steps, but does not present any formal pseudocode or algorithm blocks.
Open Source Code	Yes	The NeurIPS Paper Checklist states for 'Open access to data and code': 'Answer: [Yes] Justification: See supplementary material.'
Open Datasets	Yes	In this paper, we comprehensively evaluate the visual reasoning capabilities of our method by leveraging six existing datasets, enhanced through subtask categorization, error-prone data filtering, and dataset restructuring. Specifically, we define three task categories as follows. Visual Counting... Specifically, we filtered and corrected 35K samples from CLEVR-Math [1]... To assess generalization under domain-shift (DS), we constructed 1K new samples using 3D assets from Super-CLEVR [51]... Structure Perception... We filtered 4.5K training samples and 820 ID test samples from Geo170K [52] and Math360K [55], along with 800 samples from Geometry3K [85]... Spatial Transformation... We generated 100K samples using TRANCE [56]...
Dataset Splits	Yes	Visual Counting... filtered and corrected 35K samples from CLEVR-Math [1] for training and 1K test samples for in-domain (ID) evaluation... Structure Perception... filtered 4.5K training samples and 820 ID test samples from Geo170K [52] and Math360K [55], along with 800 samples from Geometry3K [85]... Spatial Transformation... selected 60K for training and 6K for testing...
Hardware Specification	Yes	All experiments were conducted on a cluster of servers, each equipped with 8 A800 GPUs.
Software Dependencies	No	Our implementation is built on the open-source frameworks Open-R1 [88] and vLLM [89], ensuring reproducibility and scalability. The paper mentions frameworks used but does not provide specific version numbers for any key software components.
Experiment Setup	Yes	Table 4: Detailed configuration for each training stage of Reason-RFT. The table presents the training parameters for the 2B model and 7B model across three visual reasoning tasks... includes: Per-device Batch Size, Gradient Accumulation, LR, Epoch, Optimizer, Deepspeed, Weight Decay, Warmup Ratio, LR Schedule, Max Seq. Length, Max Compl. Length, Num. of Compl., GPU Nums. And: For the Visual Counting task and Spatial Transformation task, we trained the models for 1 epoch each... For the Structure Perception task... we extended the training duration to 5 epochs... In the Reason-RFT training pipeline, all models underwent an initial Co T activation stage with 1,600 samples before proceeding to the RL phase.