Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
DisCO: Reinforcing Large Reasoning Models with Discriminative Constrained Optimization
Authors: Gang Li, Ming C. Lin, Tomer Galanti, Zhengzhong Tu, Tianbao Yang
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments on enhancing the mathematical reasoning capabilities of SFT-finetuned models show that Dis CO significantly outperforms GRPO and its improved variants such as DAPO, achieving average gains of 7% over GRPO and 6% over DAPO across six benchmark tasks for a 1.5B model.1 |
| Researcher Affiliation | Academia | 1 Texas A&M University EMAIL EMAIL |
| Pseudocode | Yes | Algorithm 1 Discriminative Constrained Optimization |
| Open Source Code | Yes | 1The code is available at: https://github.com/Optimization-AI/Dis CO |
| Open Datasets | Yes | Specifically, we use the Deep Scale R-Preview-Dataset [42] for training, which includes AIME problems from 1984 to 2023, AMC problems before 2023, and questions from the Omni-MATH [21] and Still [45] datasets, totaling approximately 40.3k unique problem-answer pairs. We evaluate models on six benchmark datasets: AIME 2024, AIME 2025, MATH 500 [28, 37], AMC 2023, Minerva [34], and Olympiad Bench (O-Bench) [26]. |
| Dataset Splits | No | We use the Deep Scale R-Preview-Dataset [42] for training, which includes AIME problems from 1984 to 2023, AMC problems before 2023, and questions from the Omni-MATH [21] and Still [45] datasets, totaling approximately 40.3k unique problem-answer pairs. We evaluate models on six benchmark datasets: AIME 2024, AIME 2025, MATH 500 [28, 37], AMC 2023, Minerva [34], and Olympiad Bench (O-Bench) [26]. |
| Hardware Specification | Yes | For all the experiments on 1.5B models, each run consumes 4*2 40G A100 GPUs and each training step takes approximately 6 minutes. For all the experiments on 7B models, each run consumes 1*8 80G H100 GPUs and each training step takes approximately 6.5 minutes. |
| Software Dependencies | No | We tune the constant learning rate in [5e 7, 1e 6, 2e 6] with Adam W optimizer with weight decay as 0.01. |
| Experiment Setup | Yes | For all the methods, we tune the constant learning rate in [5e 7, 1e 6, 2e 6] with Adam W optimizer with weight decay as 0.01. Generally, a learning rate of 2e 6 works better for the Q1.5B model, 1e 6 for the Q7B model, and 5e 7 for the L8B model. We employ a training batch size of 128, a mini-batch size of 32, and 8 responses for each question. The temperature is set to 0.6 for both training and evaluation, following the usage recommendation from [23]. For GRPO, β is set to 0.001 as commonly used [12, 42]. For GRPO-ER, we use a coefficient of 0.001 for the entropy regularization [42]. For DAPO, we set ϵlow to 0.2 and ϵhigh to 0.28 by following their paper. For our method, δ is set to 10 4 based on the empirical observation that the average KL divergence is around 2 10 5 and β is set to 103 such that the effective weight of the KL regularization when the constraint is violated by δ is on the order of β δ = 0.1. Since L-ratio and log-L scoring functions have different orders, we choose τ = 1 for L-ratio and τ = 10 for log-L scoring function, from {0.5, 1, 5, 10}. For fair comparisons, we do not implement Dynamic Sampling [79] for DAPO and other methods, as it introduces approximately three times the sampling cost at each training step. All methods are run for 1,400 steps on Q1.5B models and 1,000 steps on Q7B/L8B models. Evaluations are conducted every 200 steps, and the best performance for each method is reported. |