reproducibilityindex.ai

Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing

Authors: Ye Tian, Baolin Peng, Linfeng Song, Lifeng Jin, Dian Yu, Lei Han, Haitao Mi, Dong Yu

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experimental results in mathematical reasoning tasks demonstrate that ALPHALLM significantly enhances the performance of LLMs without additional annotations, showing the potential for self-improvement in LLMs.
Researcher Affiliation	Industry	1Tencent AI Lab, Bellevue, WA 2Tencent Robotics X
Pseudocode	Yes	Algorithm 1: LLM self-improving loop in Appendix A.1.
Open Source Code	Yes	The code is available at https://github.com/Ye Tian JHU/Alpha LLM.
Open Datasets	Yes	We choose to evaluate on two widely used datasets GSM8K (Cobbe et al., 2021) and MATH (Hendrycks et al., 2021).
Dataset Splits	Yes	We perform early stopping based on a devset held out from the training instances.
Hardware Specification	Yes	Our experiments were conducted using NVIDIA A100 40GB GPUs.
Software Dependencies	No	The paper mentions tools like 'python sympy' but does not specify version numbers for any software dependencies, libraries, or frameworks used in the experiments.
Experiment Setup	Yes	We select Llama-2-70b as the policy model for the GSM8K dataset and Wizard Math-70B-V1.0 for the MATH dataset. [...] The training employ a learning rate of 1e-6 and are trained for one epoch. [...] For policy self-improving ( 4.5), we train the policy model up to 3 epochs, setting batch size to 128, learning rate to 5 10 6 and minimal learning rate to 1 10 6.