Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing
Authors: Ye Tian, Baolin Peng, Linfeng Song, Lifeng Jin, Dian Yu, Lei Han, Haitao Mi, Dong Yu
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experimental results in mathematical reasoning tasks demonstrate that ALPHALLM significantly enhances the performance of LLMs without additional annotations, showing the potential for self-improvement in LLMs. |
| Researcher Affiliation | Industry | 1Tencent AI Lab, Bellevue, WA 2Tencent Robotics X |
| Pseudocode | Yes | Algorithm 1: LLM self-improving loop in Appendix A.1. |
| Open Source Code | Yes | The code is available at https://github.com/Ye Tian JHU/Alpha LLM. |
| Open Datasets | Yes | We choose to evaluate on two widely used datasets GSM8K (Cobbe et al., 2021) and MATH (Hendrycks et al., 2021). |
| Dataset Splits | Yes | We perform early stopping based on a devset held out from the training instances. |
| Hardware Specification | Yes | Our experiments were conducted using NVIDIA A100 40GB GPUs. |
| Software Dependencies | No | The paper mentions tools like 'python sympy' but does not specify version numbers for any software dependencies, libraries, or frameworks used in the experiments. |
| Experiment Setup | Yes | We select Llama-2-70b as the policy model for the GSM8K dataset and Wizard Math-70B-V1.0 for the MATH dataset. [...] The training employ a learning rate of 1e-6 and are trained for one epoch. [...] For policy self-improving ( 4.5), we train the policy model up to 3 epochs, setting batch size to 128, learning rate to 5 10 6 and minimal learning rate to 1 10 6. |