Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

EvolvedGRPO: Unlocking Reasoning in LVLMs via Progressive Instruction Evolution

Authors: Zhebei Shen, Qifan Yu, Juncheng Li, Wei Ji, Qizhi Chen, Siliang Tang, Yueting Zhuang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 5 EXPERIMENTS We evaluate model performance along two dimensions. First, we evaluate out-ofdomain generalization across three visual reasoning benchmarks: Math Verse [59], Math Vision [60], Math Vista [61], and two textual reasoning benchmarks: GSM8K [62] and MATH500 [63]. Second, we evaluate the performance of Evolved GRPO across general benchmarks, including MMMU [30], MMStar [31], and AI2D [64].
Researcher Affiliation Academia Zhebei Shen1 Qifan Yu1 Juncheng Li1 Wei Ji2 Qizhi Chen3 Siliang Tang1 Yueting Zhuang1 1Zhejiang University 2Nanjing University 3Peking University EMAIL
Pseudocode Yes Algorithm 1 Evolved GRPO Training Procedure
Open Source Code Yes The Code for Evolved GRPO is available at https://github.com/SHENZHEBEI/Evolved GRPO.
Open Datasets Yes MMK12 [4] dataset, a high-quality and diverse multi-modal mathematical reasoning dataset, being composed of MAVIS [55], Geo3k [56], RCOT [57], Multi Math [58] datasets. For the general experimental setup, we adopt Qwen2.5-VL-7B-Instruct as the base model and train it using the GRPO strategy. More details of the experimental setup are shown in Appendix B. Benchmarks. We evaluate model performance along two dimensions. First, we evaluate out-ofdomain generalization across three visual reasoning benchmarks: Math Verse [59], Math Vision [60], Math Vista [61], and two textual reasoning benchmarks: GSM8K [62] and MATH500 [63]. Second, we evaluate the performance of Evolved GRPO across general benchmarks, including MMMU [30], MMStar [31], and AI2D [64].
Dataset Splits No The paper mentions using the MMK12 [4] dataset for training and constructs a validation set of 3,000 instances from multiple benchmarks. However, it does not explicitly provide the specific training/test/validation splits (e.g., percentages or exact counts) for the primary MMK12 training dataset used in their experiments.
Hardware Specification Yes Table 6: Resource cost statistics for each stage of the training pipeline using 4 RTX A6000. Table 4: Training hyper-parameters of Evolved GRPO. ... Resource Usage 4 RTX A6000 4 RTX A6000
Software Dependencies No The paper mentions 'Qwen2.5-VL-7B-Instruct' as the base model, 'v LLM [65]' for accelerated inference, 'GPT-4o [1]' as the evaluation judge, and 'rule-based mathruler [53]' for reward calculation. However, it does not provide specific version numbers for these software components or any other libraries and frameworks used.
Experiment Setup Yes We train our models using GRPO strategy. The both models are initialized with Qwen2.5-VL-7B-Instruct [23]. The detailed hyper-parameters used during training are summarized in Table 4. In the training process, the details of two different types of instructions are presented in Table 5. Table 4: Training hyper-parameters of Evolved GRPO. (This table lists LLM Init, KL Penalty, KL Coefficient, Optimizer, Learning Rate, Weight Decay, Numerical Precision, Gradient Clipping, Rollout n, Rollout Temperature, Rollout Top-p, Rollout Batch Size, Micro Batch Size for Update, Micro Batch Size for Experience, Training Steps, Total Epochs).