reproducibilityindex.ai

Retroformer: Retrospective Large Language Agents with Policy Gradient Optimization

Authors: Weiran Yao, Shelby Heinecke, Juan Carlos Niebles, Zhiwei Liu, Yihao Feng, Le Xue, Rithesh R N, Zeyuan Chen, Jianguo Zhang, Devansh Arpit, Ran Xu, Phil L Mui, Huan Wang, Caiming Xiong, Silvio Savarese

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results on various tasks demonstrate that the language agents improve over time and that our approach considerably outperforms baselines that do not properly leverage gradients from the environment.
Researcher Affiliation	Industry	Salesforce AI Research
Pseudocode	Yes	Algorithm The offline PPO algorithm we used for finetuning the Retrospective component in this paper is presented below in Algorithm 1.
Open Source Code	Yes	Code: https://github.com/Salesforce AIResearch/Retroformer
Open Datasets	Yes	We use open-source environments: Hot Pot QA (Yang et al., 2018), Web Shop (Yao et al., 2022) and Alf World (Shridhar et al., 2021)
Dataset Splits	Yes	The agent is evaluated on 100 validation tasks from the distractor dev split of open-source Hot Pot QA dataset, 134 tasks in Alf World and 100 tasks in Web Shop, as in (Shinn et al., 2023).
Hardware Specification	Yes	All experiments are done in Google Cloud Platform (GCP) GKE environment with A100 40GB GPUs.
Software Dependencies	No	The paper mentions software like 'Open AI connectors from langchain', 'Fast Chat', and 'trl' but does not specify their version numbers.
Experiment Setup	Yes	We fine-tune the retrospective model Mr with 4-bit quantized Lo RA adapters (r=1 or r=4) on the offline RL datasets with epochs=4; batch size=8; lr=1.4e-5. The number of trainable parameters is 0.53M (0.015% of llama-7b) or 2.25M. Since longchat-16k is based on Llama, we used the default llama recipes for finetuning. Specifically, we first run supervised fine-tuning trainer on the samples with positive ratings for 2 epochs and then the RLHF pipeline, including reward modeling, and RL fine-tuning with PPO, on the whole offline rating dataset using the default settings for llama-7b model. We list the key hyperparameters here: Supervised Finetuning: learning rate=1e-5, batch size=32, max steps=5,000. Reward Modeling: learning rate=2.5e-5, batch size=32, max steps=20,000. Policy Gradient Finetuning: learning rate=1.4e-5, max steps=20,000, output max length=128, batch size=64, gradient accumulation steps=8, ppo epochs=4.