Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

NoisyRollout: Reinforcing Visual Reasoning with Data Augmentation

Authors: Xiangyan Liu, Jinjie Ni, Zijian Wu, Chao Du, Longxu Dou, Haonan Wang, Tianyu Pang, Michael Qizhe Shieh

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on 2 distinct training datasets demonstrate that Noisy Rollout achieves state-of-the-art performance among opensource RL-tuned models across 5 out-of-domain reasoning and perception benchmarks. Furthermore, we validate the effectiveness of Noisy Rollout across model sizes (7B and 32B), data scales (from 1K to 6K) and image augmentation types (Gaussion noise and rotation), highlighting its generalizability and scalability.
Researcher Affiliation	Collaboration	1National University of Singapore 2Sea AI Lab
Pseudocode	Yes	A simplified overview is provided in Figure 1 and Algorithm 1.
Open Source Code	Yes	The code and running instructions are provided in the supplementary materials.
Open Datasets	Yes	Our experiments utilize two datasets: Geometry3K [47], focused on geometric problem solving, and MMK12 [53], covering diverse K-12 math topics. These datasets comprise 2.1K and 6.4K training samples respectively.
Dataset Splits	No	Trained with only 2.1K samples from the Geometry3K [47] dataset... Second, we evaluate the in-domain performance of Noisy Rollout by comparing it with the vanilla GRPO baseline on the Geometry3K test set.
Hardware Specification	Yes	All experiments are conducted using 8 A100 GPUs (40G for 7B model, 80G for 32B model).
Software Dependencies	No	We use Easy R1 [96] as our reinforcement learning training framework, which is built on verl [65] and specifically designed for VLMs. Our experiments utilize two datasets: Geometry3K [47]... We initialize our policy models with Qwen2.5-VL-7/32B-Instruct...
Experiment Setup	Yes	For other general RL-related hyperparameters, we adopt the default settings from Easy R1: a global batch size of 128, a rollout batch size of 512, a rollout temperature of 1.0, and a learning rate of 1e-6. For Noisy Rollout-specific configurations, we adopt Gaussian noise as the default image distortion strategy, and apply a sigmoid-shaped annealing schedule: αt = η(α0, t, tmax) = α0 * (1 - 1 / (1 + e^(λ(t - γ)/tmax))) where γ determines the midpoint of the annealing curve and λ controls its steepness. Additional implementation details regarding the number of training steps/epochs and the hyperparameters for image distortion and noise annealing are presented in Appendix J.