Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing

Authors: Ye Tian, Baolin Peng, Linfeng Song, Lifeng Jin, Dian Yu, Lei Han, Haitao Mi, Dong Yu

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experimental results in mathematical reasoning tasks demonstrate that ALPHALLM significantly enhances the performance of LLMs without additional annotations, showing the potential for self-improvement in LLMs.
Researcher Affiliation Industry 1Tencent AI Lab, Bellevue, WA 2Tencent Robotics X
Pseudocode Yes Algorithm 1: LLM self-improving loop in Appendix A.1.
Open Source Code Yes The code is available at https://github.com/Ye Tian JHU/Alpha LLM.
Open Datasets Yes We choose to evaluate on two widely used datasets GSM8K (Cobbe et al., 2021) and MATH (Hendrycks et al., 2021).
Dataset Splits Yes We perform early stopping based on a devset held out from the training instances.
Hardware Specification Yes Our experiments were conducted using NVIDIA A100 40GB GPUs.
Software Dependencies No The paper mentions tools like 'python sympy' but does not specify version numbers for any software dependencies, libraries, or frameworks used in the experiments.
Experiment Setup Yes We select Llama-2-70b as the policy model for the GSM8K dataset and Wizard Math-70B-V1.0 for the MATH dataset. [...] The training employ a learning rate of 1e-6 and are trained for one epoch. [...] For policy self-improving ( 4.5), we train the policy model up to 3 epochs, setting batch size to 128, learning rate to 5 10 6 and minimal learning rate to 1 10 6.