Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Self-Improvement in Language Models: The Sharpening Mechanism
Authors: Audrey Huang, Adam Block, Dylan Foster, Dhruv Rohatgi, Cyril Zhang, Max Simchowitz, Jordan Ash, Akshay Krishnamurthy
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we empirically validate the sharpening mechanism via inference-time and amortization experiments. We view these findings as a starting point toward a foundational understanding that can guide the design and evaluation of self-improvement algorithms. [...] Empirical investigation (Appendix A). We explore empirically the extent to which our theoretical framework and methods improve language model performance in a variety of tasks. |
| Researcher Affiliation | Collaboration | Audrey Huang UIUC EMAIL Adam Block Microsoft Research EMAIL Dylan J. Foster Microsoft Research EMAIL Dhruv Rohatgi MIT EMAIL Cyril Zhang Microsoft Research EMAIL Max Simchowitz CMU EMAIL Jordan T. Ash Microsoft Research EMAIL Akshay Krishnamurthy Microsoft Research EMAIL |
| Pseudocode | Yes | Algorithm 1 Reward-based variant of Exploratory Preference Optimization (Xie et al., 2024) input: Base model πbase : X (Y), reward function r : X Y R, number of iterations T N, KL regularization coefficient β > 0, optimism coefficient α > 0. Initialize: π(1) πbase, D(0) . for iteration t = 1, . . . , T do Generate sample: (x(t), y(t), ey(t)) via x(t) µ, y(t) π(t)( | x(t)), ey(t) πbase( | x(t)). Update dataset: D(t) D(t 1) {(x(t), y(t), ey(t))}. Model optimization with global optimism: π(t+1) arg min π Π (x,y,y ) D(t) log(π(y | x)) (x,y,y ) D(t) β log π(y | x) πbase(y | x) β log π(y | x) πbase(y | x) (r(x, y) r(x, y )) 2 ) return: bπ arg maxt [T +1] Jβ(π(t)). Can estimate Jβ(π(t)) using validation data. |
| Open Source Code | No | The paper lists models used (e.g., Phi3-Mini, Llama3.2-3B, Mistral-7B) and states: 'All models, except for gpt-3.5-turbo-instruct, are available on https://huggingface.co and we provide Hugging Face model identifiers below.' However, this refers to third-party models used in their experiments, not the authors' own implementation code for the methodology described in the paper. There is no explicit statement about releasing their own code or a direct link to a code repository. |
| Open Datasets | Yes | MATH: We use the above models to generate responses to prompts from the MATH (Hendrycks et al., 2021) [...] GSM8k: We use the above models to generate responses to prompts from the GSM-8k dataset (Cobbe et al., 2021) [...] Pronto QA: We use the above models to generate responses to prompts from the Pronto QA dataset (Saparov & He, 2023) [...] MMLU: We use the above models to generate responses to prompts from three subsets of the MMLU dataset (Hendrycks et al., 2020) [...] Game Of24: We use only the model of Wan et al. (2024) (i.e., llama2-7b-game24-policy-hf), on the Game Of24 task (Yao et al., 2024). |
| Dataset Splits | Yes | MATH: We consider all subsets and take the first 256 examples of the test set where the solution matches the regular expression (\d*). [...] GSM8k: We take the first 256 examples from the test set in the main subset. [...] Pronto QA: We take the first 256 examples from the training set. [...] MMLU: We take the first 256 examples of the test set. [...] Game Of24: Here we use both the train and test splits of the dataset. |
| Hardware Specification | Yes | All of our experiments were run either on 40G NVIDIA A100 GPUs, 192G AMD MI300X GPUs, or through the Open AI API. |
| Software Dependencies | No | The paper does not explicitly mention specific software dependencies with version numbers, such as Python versions or specific library versions (e.g., PyTorch, TensorFlow, Hugging Face Transformers). |
| Experiment Setup | Yes | For all models and datasets except for Game Of24, we used 1-shot prompting... We set the maximum length of decoding to be 512 tokens. We used 10 seeds for all (model, task) pairs with a maximum value of N = 50 in Best-of-N sampling. For Best-of-N sampling, we always use temperature 1.0. ... We report the specific hyperparamters chosen in Table 2. On all models, we used a learning rate of 3 × 10−4 with linear decay to zero and gradient clamping at 0.1. In all experiments involving Phi3.5-Mini we use a batch size of 4; unfortunately, due to a known numerical issue with LoRA on Mistral-7B-Instruct-v0.3 involving batch size > 1, we use a batch of 1 in this case. |