reproducibilityindex.ai

Recursive Introspection: Teaching Language Model Agents How to Self-Improve

Authors: Yuxiao Qu, Tianjun Zhang, Naman Garg, Aviral Kumar

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments show that RISE enables Llama2, Llama3, and Mistral models to improve themselves with more turns on reasoning tasks, outperforming several single-turn strategies given an equal amount of inference-time computation.
Researcher Affiliation	Collaboration	Yuxiao Qu1, Tianjun Zhang2, Naman Garg3, Aviral Kumar1 1Carnegie Mellon University, 2UC Berkeley, 3Multi On
Pseudocode	Yes	A complete algorithmic pseudocode for each approach is shown in Appendix D. ... Algorithm 1 Data Collection at Iteration T ... Algorithm 2 Inference at iteration T
Open Source Code	Yes	The code is publicly available at https://github.com/cmu-mind/RISE
Open Datasets	Yes	Specifically, on the GSM8K [12] dataset... We see similar trends on the MATH dataset [20]... The GSM8K dataset consists of 7,473 problems in the training portion and 1,319 problems in the testing portion. Similarly, the MATH dataset is divided into 7,500 problems for training and 1,000 problems for testing.
Dataset Splits	No	The GSM8K dataset consists of 7,473 problems in the training portion and 1,319 problems in the testing portion. Similarly, the MATH dataset is divided into 7,500 problems for training and 1,000 problems for testing. The training portions of both datasets are used to generate trajectories in each iteration of the RISE method, while the testing portions are held out for evaluating the performance of the models.
Hardware Specification	Yes	The hyperparameters used for finetuning are specified in Table 9. ... gpus 4x A40
Software Dependencies	No	For finetuning, we utilize the Fast Chat codebase, but we customize the loss function to be weighted by reward. The base models are directly loaded from Hugging Face: hrefhttps://huggingface.co/metallama/Llama-2-7b-hf Llama-2-7b-chat-hf and Mistral-7B-Instruct-v0.2. The hyperparameters used for finetuning are specified in Table 9.
Experiment Setup	Yes	The hyperparameters used for finetuning are specified in Table 9. Hyperparameter Values bf16 True epochs 2 per device train batch size 1 gpus 4x A40 gradient accumulation steps 16 learning rate 1e-5 weighted decay 0 warmup ratio 0.04 learning rate scheduler trype cosince tf32 True model max length 2048