Retroformer: Retrospective Large Language Agents with Policy Gradient Optimization

Authors: Weiran Yao, Shelby Heinecke, Juan Carlos Niebles, Zhiwei Liu, Yihao Feng, Le Xue, Rithesh R N, Zeyuan Chen, Jianguo Zhang, Devansh Arpit, Ran Xu, Phil L Mui, Huan Wang, Caiming Xiong, Silvio Savarese

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results on various tasks demonstrate that the language agents improve over time and that our approach considerably outperforms baselines that do not properly leverage gradients from the environment.
Researcher Affiliation Industry Salesforce AI Research
Pseudocode Yes Algorithm The offline PPO algorithm we used for finetuning the Retrospective component in this paper is presented below in Algorithm 1.
Open Source Code Yes Code: https://github.com/Salesforce AIResearch/Retroformer
Open Datasets Yes We use open-source environments: Hot Pot QA (Yang et al., 2018), Web Shop (Yao et al., 2022) and Alf World (Shridhar et al., 2021)
Dataset Splits Yes The agent is evaluated on 100 validation tasks from the distractor dev split of open-source Hot Pot QA dataset, 134 tasks in Alf World and 100 tasks in Web Shop, as in (Shinn et al., 2023).
Hardware Specification Yes All experiments are done in Google Cloud Platform (GCP) GKE environment with A100 40GB GPUs.
Software Dependencies No The paper mentions software like 'Open AI connectors from langchain', 'Fast Chat', and 'trl' but does not specify their version numbers.
Experiment Setup Yes We fine-tune the retrospective model Mr with 4-bit quantized Lo RA adapters (r=1 or r=4) on the offline RL datasets with epochs=4; batch size=8; lr=1.4e-5. The number of trainable parameters is 0.53M (0.015% of llama-7b) or 2.25M. Since longchat-16k is based on Llama, we used the default llama recipes for finetuning. Specifically, we first run supervised fine-tuning trainer on the samples with positive ratings for 2 epochs and then the RLHF pipeline, including reward modeling, and RL fine-tuning with PPO, on the whole offline rating dataset using the default settings for llama-7b model. We list the key hyperparameters here: Supervised Finetuning: learning rate=1e-5, batch size=32, max steps=5,000. Reward Modeling: learning rate=2.5e-5, batch size=32, max steps=20,000. Policy Gradient Finetuning: learning rate=1.4e-5, max steps=20,000, output max length=128, batch size=64, gradient accumulation steps=8, ppo epochs=4.