AGILE: A Novel Reinforcement Learning Framework of LLM Agents

Authors: Feng Peiyuan, Yichen He, Guanhua Huang, Yuan Lin, Hanchong Zhang, Yuchen Zhang, Hang Li

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our extensive experiments on Product QA, Med MCQA and Hot Pot QA show that AGILE agents based on 7B and 13B LLMs trained with PPO can outperform GPT-4 agents. Our ablation study highlights the indispensability of memory, tools, consultation, reflection, and reinforcement learning in achieving the agent s strong performance.
Researcher Affiliation Collaboration Peiyuan Feng 1 Yichen He 1 Guanhua Huang 2 Yuan Lin 1 Hanchong Zhang 3 Yuchen Zhang 1 Hang Li1 1Byte Dance Research 2University of Science and Technology of China 3Shanghai Jiao Tong University {fpy,hyc,linyuan.0,zhangyuchen.zyc,lihang.lh}@bytedance.com, guanhuahuang@mail.ustc.edu.cn, zhanghanchong@sjtu.edu.cn
Pseudocode Yes Finally, we present the session-level optimization algorithm as Algorithm 1. In this algorithm, the state advantage function is the only component that concerns inter-session correlation. While the algorithm is iterative, we anticipate that in practice, the outer loop will require only a few iterations to converge.
Open Source Code Yes Datasets and code are available at https://github.com/bytarnish/AGILE.
Open Datasets Yes We focus on question answering and release a dataset for agents called Product QA, comprising challenging questions in online shopping. Our extensive experiments on Product QA, Med MCQA and Hot Pot QA show that AGILE agents based on 7B and 13B LLMs trained with PPO can outperform GPT-4 agents. Our ablation study highlights the indispensability of memory, tools, consultation, reflection, and reinforcement learning in achieving the agent s strong performance. Datasets and code are available at https://github.com/bytarnish/AGILE.
Dataset Splits Yes We evaluate our agent framework on three tasks, Product QA, Med MCQA and Hot Pot QA. For Product QA, we use a two-stage training method based on Vicuna-13b [6]. In the first stage, imitation learning is employed to create agile-vic13b-sft. In the second stage, the policy gradient algorithm of PPO [37] produces agile-vic13b-ppo.
Hardware Specification Yes The training runs on NVIDIA-H800.
Software Dependencies No The paper mentions models like 'Vicuna-13b' and 'Meerkat-7b' and the 'PPO algorithm', but does not provide specific version numbers for general software dependencies or libraries (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes We fine-tune the model for 2 epochs with a learning rate of 1e-5 and a batch size of 64. We implement PPO for 1 epoch with a learning rate of 1e-6 and a batch size of 64.