Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

LILO: Learning to Reason at the Frontier of Learnability

Authors: Thomas Foster, Anya Sims, Johannes Forkel, Jakob Foerster

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We run a wide range of experiments over multiple base models, algorithms and reasoning datasets to demonstrate that LILO consistently reaches a higher final test accuracy, and can do so in 3 fewer training steps. Results: Section 6 presents the first results that prioritising learnability during the RL training of LLMs improves training speed by 3 whilst boosting final performance.
Researcher Affiliation	Collaboration	The paper lists authors without explicit affiliations on the first page, but references throughout the paper point to both academic institutions (e.g., "University of Wisconsin-Madison Department of Computer Sciences" in [49]) and industry affiliations (e.g., "OpenAI" in [3], "Google" in [38], "Meta AI" in [32]). Given that the work cites research from both types of organizations and Jakob Foerster is often associated with academic research at institutions, it is most likely a collaboration. However, without explicit affiliations for the authors of this paper, it is hard to definitively classify. Based on the NeurIPS checklist, the paper was submitted to NeurIPS 2025, which is an academic conference. Jakob Foerster is a known academic (e.g., Oxford, DeepMind, University of Cambridge). Given the context of the conference and common affiliations of such researchers, a collaborative environment involving academia and potentially industry (e.g., DeepMind is Google-owned) is highly probable for a paper of this nature.
Pseudocode	Yes	Thus, in Algorithm 1 we present a method that, at every training step 1) produces a batch of questions with high learnability and 2) trains on this batch. In Algorithm 2, we introduce a simple method to produce a batch of questions with high learnability, based on rejection sampling. ... we present Algorithm 3 which is smarter in how it samples.
Open Source Code	No	As in the previous checlist item, our the results in 6 were produced using existing open-source codebases, models and data [5] [4]. This is described in Section 5. We will open-source the small modifications we made to these repos upon acceptance.
Open Datasets	Yes	We evaluate LILO across three RL algorithms (GRPO [6], PPO [7] and Vine PPO [5]), three training datasets of varying size and difficulty (the often standard GSM8K [8] dataset, the more challenging MATH [9] dataset, and the larger and more diverse ORZ57K [10]) and two base models (Rho-1B [11] and Qwen-2.5-1.5B [12]). We further evaluate downstream performance of the MATH-trained models on College MATH [13] (2,818 college-level questions) and Olympiad Bench [14] (8,000 Olympiad level maths and physics competitions).
Dataset Splits	No	Datasets: For Vine PPO and PPO experiments, we train on mathematical reasoning datasets MATH [9], ( 12,000 competition-level problems), and GSM8K [8], (8,000 simpler grade school problems). ... Metrics: We evaluate model performance on the test sets of each dataset, using accuracy (Pass@1) as the primary metric. While the paper mentions training and testing on datasets, it does not explicitly provide the split ratios or counts for how these datasets were divided into training, validation, and test sets. It only mentions the total number of problems for MATH (12,000) and GSM8K (8,000) and that test sets were used.
Hardware Specification	Yes	The PPO and Vine PPO experiments took 1 week on 4x L40s GPUs for each algorithm, dataset combination. The GRPO experiments took 1 day on 8x H200 GPUs for each of GRPO with LILO and GRPO without LILO.
Software Dependencies	No	RL algorithms: We add LILO to two existing open-source libraries for training LLMs to reason with RL. The Vine PPO library [5] provides implementations of PPO and Vine PPO. The OAT library [4] implements GRPO tuned to closely replicate the Deepseek R1 results. The paper mentions these libraries (Vine PPO and OAT) but does not specify their version numbers or other software dependencies with versions.
Experiment Setup	Yes	Hyperparameters: The main hyperparameters for rejection sampling in Algorithm 2 is the size of candidate pool \|D\|, and the value of Nlearnability, the number of responses sampled per question. Ideally, these would be tuned to be as small as possible whilst achieving batches with high learnability. In practice, choosing \|D\| = 4 \|B\| and Nlearnability = 8 works well with minimal overhead. The only exception is for Vine PPO on GSM8K, where the model nearing 95% train accuracy requires \|D\| = 8 \|B\| to produce high learnability batches. All of the hyperparameters for PPO, Vine PPO and GRPO we leave unchanged from their implementations in [5] and [4].