Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing
Authors: Ye Tian, Baolin Peng, Linfeng Song, Lifeng Jin, Dian Yu, Lei Han, Haitao Mi, Dong Yu
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experimental results in mathematical reasoning tasks demonstrate that ALPHALLM significantly enhances the performance of LLMs without additional annotations, showing the potential for self-improvement in LLMs. |
| Researcher Affiliation | Industry | 1Tencent AI Lab, Bellevue, WA 2Tencent Robotics X |
| Pseudocode | Yes | Algorithm 1: LLM self-improving loop in Appendix A.1. |
| Open Source Code | Yes | The code is available at https://github.com/Ye Tian JHU/Alpha LLM. |
| Open Datasets | Yes | We choose to evaluate on two widely used datasets GSM8K (Cobbe et al., 2021) and MATH (Hendrycks et al., 2021). |
| Dataset Splits | Yes | We perform early stopping based on a devset held out from the training instances. |
| Hardware Specification | Yes | Our experiments were conducted using NVIDIA A100 40GB GPUs. |
| Software Dependencies | No | The paper mentions tools like 'python sympy' but does not specify version numbers for any software dependencies, libraries, or frameworks used in the experiments. |
| Experiment Setup | Yes | We select Llama-2-70b as the policy model for the GSM8K dataset and Wizard Math-70B-V1.0 for the MATH dataset. [...] The training employ a learning rate of 1e-6 and are trained for one epoch. [...] For policy self-improving ( 4.5), we train the policy model up to 3 epochs, setting batch size to 128, learning rate to 5 10 6 and minimal learning rate to 1 10 6. |