Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Pass@K Policy Optimization: Solving Harder Reinforcement Learning Problems
Authors: Christian Walder, Deep Tejas Karkhanis
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate our transformations on illustrative toy experiments, which reveal the variance reducing properties of our formulations. We also include real-world examples using the open-source models GEMMA2 and LLAMA3.1. We find that our transformation effectively optimizes for the target k. Furthermore, higher k values enable solving more and harder problems, while annealing k boosts both the pass@1 and pass@k. |
| Researcher Affiliation | Industry | Christian Walder & Deep Karkhanis Google Deep Mind cwalder/EMAIL |
| Pseudocode | Yes | Listing 1: Python reward batch transformations. Functions with names that begin with an underscore are helpers, while the remaining four functions rho, s, sloo and sloo minus one implement ρ(g), si, s(loo) i and s(loo 1) i , respectively. For simplicity this implementation costs O(nk + n log n) reducing this to O(k + n log n) would require optimizing deltas and m diagonal. |
| Open Source Code | Yes | Listing 1: Python reward batch transformations. Functions with names that begin with an underscore are helpers, while the remaining four functions rho, s, sloo and sloo minus one implement ρ(g), si, s(loo) i and s(loo 1) i , respectively. For simplicity this implementation costs O(nk + n log n) reducing this to O(k + n log n) would require optimizing deltas and m diagonal. |
| Open Datasets | Yes | We demonstrate promising RL results with the 2B and 9B parameter variants of GEMMA2 [TRP+24] and the 8B parameter variant of LLAMA3.1 on real-world problems in MATH [HBK+21], code generation [AON+21] [CTJ+21b], and the easy public subset of ARC-AGI-1 [CKKL25]. |
| Dataset Splits | Yes | We use the training split of Hendrycks MATH [HBK+21] which contains 12,000 problems as our task set. ... We make an 80:20 train:test split of the same easy subset as before and report the cumulative solve rate on the train set and pass@k rate on the test set. |
| Hardware Specification | Yes | For GEMMA2-2B we use a v5litepod-128 [Goo] which needs around 4 hours per 1000 training steps. |
| Software Dependencies | No | The paper provides a Python code listing but does not specify version numbers for Python itself or any other software libraries or frameworks used for the experiments. |
| Experiment Setup | Yes | For our experiments, we set n = 16. ... We repeat the training for a selection of kopt, thus optimizing a different pass@kopt each time. ... small sweep over the values 0.001, 0.005, 0.01, 0.05, 0.1 for the entropy coefficient for each (model, benchmark) pair and only report the best result as Entropy Reg. ... We show a simple annealing procedure which starts training with a high kopt = 8 and reduces it to kopt = 1 after 1500 steps. |