Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

NoisyGRPO: Incentivizing Multimodal CoT Reasoning via Noise Injection and Bayesian Estimation

Authors: Longtian Qiu, Shan Ning, Jiaxuan Sun, Xuming He

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on standard Co T quality, general capability, and hallucination benchmarks demonstrate that Noisy GRPO substantially improves generalization and robustness, especially in RL settings with small-scale MLLMs such as Qwen2.5-VL 3B. The project page is available at https://artanic30.github.io/project_pages/Noisy GRPO.
Researcher Affiliation Academia 1Shanghai Tech University, Shanghai, China 2Shanghai Engineering Research Center of Intelligent Vision and Imaging 3Lingang Laboratory, Shanghai, China EMAIL
Pseudocode No The paper describes the methodology using textual explanations and mathematical equations, but it does not contain a clearly labeled 'Pseudocode' or 'Algorithm' block.
Open Source Code Yes The project page is available at https://artanic30.github.io/project_pages/Noisy GRPO.
Open Datasets Yes To enhance the general Chain-of-Thought (Co T) reasoning ability of MLLMs through reinforcement learning, we adopt the visual question answering (VQA) portion of the MM-RLHF [62] training set as our training data. This dataset spans diverse domains, including conversations, safety, multiple-choice questions, captions, and commonsense reasoning. MM-RLHF applies clustering and filtering techniques to curate a high-quality visual instruction-following dataset. In total, it contains 13k VQA samples: 1.2k yes/no questions, 1.3k multiple-choice questions, and 10k open-ended VQA questions. Notably, most of the training data lack structured answers; therefore, we use a text embedding model to compute the accuracy reward. While this may introduce some model bias, our Bayesian advantage estimation 3.3 helps mitigate it.
Dataset Splits No The paper mentions using a 'training set' of 13k samples (MM-RLHF) and evaluating on 'evaluation benchmarks', but it does not specify how the 13k training data is split into training, validation, or test sets for its own experiments. It describes the composition of the training set but not its partitioning. For ablation, it mentions random sampling of subsets (3k and 6k) from the full training corpus but not specific train/val/test splits.
Hardware Specification Yes The training takes 6 hours for the 3B variant and 7 hours for the 7B on a single node with 8A100 GPUs.
Software Dependencies No The paper mentions implementing Noisy GRPO based on the VLM-R1 reinforcement learning training framework, using Qwen2.5-VL as the policy model, and utilizing the SBERT model for embeddings. However, it does not provide specific version numbers for these frameworks or models, nor for other underlying software dependencies like Python or PyTorch.
Experiment Setup Yes For hyperparameters, we follow the default settings provided by VLM-R1 except for the following changes. The number of sampled rollouts G is set to 4 due to computational resource considerations. In noise injection, the upper bound σ for the U(0, σ) is set to 1. The α and γ are set to 0.1 and 0.01. The threshold τ for the accuracy reward is set to 0.6. As there is no validation set, we choose all the hyperparameters based on the results on MMStar [5].