Retroformer: Retrospective Large Language Agents with Policy Gradient Optimization
Authors: Weiran Yao, Shelby Heinecke, Juan Carlos Niebles, Zhiwei Liu, Yihao Feng, Le Xue, Rithesh R N, Zeyuan Chen, Jianguo Zhang, Devansh Arpit, Ran Xu, Phil L Mui, Huan Wang, Caiming Xiong, Silvio Savarese
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results on various tasks demonstrate that the language agents improve over time and that our approach considerably outperforms baselines that do not properly leverage gradients from the environment. |
| Researcher Affiliation | Industry | Salesforce AI Research |
| Pseudocode | Yes | Algorithm The offline PPO algorithm we used for finetuning the Retrospective component in this paper is presented below in Algorithm 1. |
| Open Source Code | Yes | Code: https://github.com/Salesforce AIResearch/Retroformer |
| Open Datasets | Yes | We use open-source environments: Hot Pot QA (Yang et al., 2018), Web Shop (Yao et al., 2022) and Alf World (Shridhar et al., 2021) |
| Dataset Splits | Yes | The agent is evaluated on 100 validation tasks from the distractor dev split of open-source Hot Pot QA dataset, 134 tasks in Alf World and 100 tasks in Web Shop, as in (Shinn et al., 2023). |
| Hardware Specification | Yes | All experiments are done in Google Cloud Platform (GCP) GKE environment with A100 40GB GPUs. |
| Software Dependencies | No | The paper mentions software like 'Open AI connectors from langchain', 'Fast Chat', and 'trl' but does not specify their version numbers. |
| Experiment Setup | Yes | We fine-tune the retrospective model Mr with 4-bit quantized Lo RA adapters (r=1 or r=4) on the offline RL datasets with epochs=4; batch size=8; lr=1.4e-5. The number of trainable parameters is 0.53M (0.015% of llama-7b) or 2.25M. Since longchat-16k is based on Llama, we used the default llama recipes for finetuning. Specifically, we first run supervised fine-tuning trainer on the samples with positive ratings for 2 epochs and then the RLHF pipeline, including reward modeling, and RL fine-tuning with PPO, on the whole offline rating dataset using the default settings for llama-7b model. We list the key hyperparameters here: Supervised Finetuning: learning rate=1e-5, batch size=32, max steps=5,000. Reward Modeling: learning rate=2.5e-5, batch size=32, max steps=20,000. Policy Gradient Finetuning: learning rate=1.4e-5, max steps=20,000, output max length=128, batch size=64, gradient accumulation steps=8, ppo epochs=4. |