Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Self-Improvement in Language Models: The Sharpening Mechanism

Authors: Audrey Huang, Adam Block, Dylan Foster, Dhruv Rohatgi, Cyril Zhang, Max Simchowitz, Jordan Ash, Akshay Krishnamurthy

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Finally, we empirically validate the sharpening mechanism via inference-time and amortization experiments. We view these findings as a starting point toward a foundational understanding that can guide the design and evaluation of self-improvement algorithms. [...] Empirical investigation (Appendix A). We explore empirically the extent to which our theoretical framework and methods improve language model performance in a variety of tasks.
Researcher Affiliation	Collaboration	Audrey Huang UIUC EMAIL Adam Block Microsoft Research EMAIL Dylan J. Foster Microsoft Research EMAIL Dhruv Rohatgi MIT EMAIL Cyril Zhang Microsoft Research EMAIL Max Simchowitz CMU EMAIL Jordan T. Ash Microsoft Research EMAIL Akshay Krishnamurthy Microsoft Research EMAIL
Pseudocode	Yes	Algorithm 1 Reward-based variant of Exploratory Preference Optimization (Xie et al., 2024) input: Base model πbase : X (Y), reward function r : X Y R, number of iterations T N, KL regularization coefficient β > 0, optimism coefficient α > 0. Initialize: π(1) πbase, D(0) . for iteration t = 1, . . . , T do Generate sample: (x(t), y(t), ey(t)) via x(t) µ, y(t) π(t)( \| x(t)), ey(t) πbase( \| x(t)). Update dataset: D(t) D(t 1) {(x(t), y(t), ey(t))}. Model optimization with global optimism: π(t+1) arg min π Π (x,y,y ) D(t) log(π(y \| x)) (x,y,y ) D(t) β log π(y \| x) πbase(y \| x) β log π(y \| x) πbase(y \| x) (r(x, y) r(x, y )) 2 ) return: bπ arg maxt [T +1] Jβ(π(t)). Can estimate Jβ(π(t)) using validation data.
Open Source Code	No	The paper lists models used (e.g., Phi3-Mini, Llama3.2-3B, Mistral-7B) and states: 'All models, except for gpt-3.5-turbo-instruct, are available on https://huggingface.co and we provide Hugging Face model identifiers below.' However, this refers to third-party models used in their experiments, not the authors' own implementation code for the methodology described in the paper. There is no explicit statement about releasing their own code or a direct link to a code repository.
Open Datasets	Yes	MATH: We use the above models to generate responses to prompts from the MATH (Hendrycks et al., 2021) [...] GSM8k: We use the above models to generate responses to prompts from the GSM-8k dataset (Cobbe et al., 2021) [...] Pronto QA: We use the above models to generate responses to prompts from the Pronto QA dataset (Saparov & He, 2023) [...] MMLU: We use the above models to generate responses to prompts from three subsets of the MMLU dataset (Hendrycks et al., 2020) [...] Game Of24: We use only the model of Wan et al. (2024) (i.e., llama2-7b-game24-policy-hf), on the Game Of24 task (Yao et al., 2024).
Dataset Splits	Yes	MATH: We consider all subsets and take the first 256 examples of the test set where the solution matches the regular expression (\d*). [...] GSM8k: We take the first 256 examples from the test set in the main subset. [...] Pronto QA: We take the first 256 examples from the training set. [...] MMLU: We take the first 256 examples of the test set. [...] Game Of24: Here we use both the train and test splits of the dataset.
Hardware Specification	Yes	All of our experiments were run either on 40G NVIDIA A100 GPUs, 192G AMD MI300X GPUs, or through the Open AI API.
Software Dependencies	No	The paper does not explicitly mention specific software dependencies with version numbers, such as Python versions or specific library versions (e.g., PyTorch, TensorFlow, Hugging Face Transformers).
Experiment Setup	Yes	For all models and datasets except for Game Of24, we used 1-shot prompting... We set the maximum length of decoding to be 512 tokens. We used 10 seeds for all (model, task) pairs with a maximum value of N = 50 in Best-of-N sampling. For Best-of-N sampling, we always use temperature 1.0. ... We report the specific hyperparamters chosen in Table 2. On all models, we used a learning rate of 3 × 10−4 with linear decay to zero and gradient clamping at 0.1. In all experiments involving Phi3.5-Mini we use a batch size of 4; unfortunately, due to a known numerical issue with LoRA on Mistral-7B-Instruct-v0.3 involving batch size > 1, we use a batch of 1 in this case.