reproducibilityindex.ai

AGILE: A Novel Reinforcement Learning Framework of LLM Agents

Authors: Feng Peiyuan, Yichen He, Guanhua Huang, Yuan Lin, Hanchong Zhang, Yuchen Zhang, Hang Li

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our extensive experiments on Product QA, Med MCQA and Hot Pot QA show that AGILE agents based on 7B and 13B LLMs trained with PPO can outperform GPT-4 agents. Our ablation study highlights the indispensability of memory, tools, consultation, reflection, and reinforcement learning in achieving the agent s strong performance.
Researcher Affiliation	Collaboration	Peiyuan Feng 1 Yichen He 1 Guanhua Huang 2 Yuan Lin 1 Hanchong Zhang 3 Yuchen Zhang 1 Hang Li1 1Byte Dance Research 2University of Science and Technology of China 3Shanghai Jiao Tong University {fpy,hyc,linyuan.0,zhangyuchen.zyc,lihang.lh}@bytedance.com, guanhuahuang@mail.ustc.edu.cn, zhanghanchong@sjtu.edu.cn
Pseudocode	Yes	Finally, we present the session-level optimization algorithm as Algorithm 1. In this algorithm, the state advantage function is the only component that concerns inter-session correlation. While the algorithm is iterative, we anticipate that in practice, the outer loop will require only a few iterations to converge.
Open Source Code	Yes	Datasets and code are available at https://github.com/bytarnish/AGILE.
Open Datasets	Yes	We focus on question answering and release a dataset for agents called Product QA, comprising challenging questions in online shopping. Our extensive experiments on Product QA, Med MCQA and Hot Pot QA show that AGILE agents based on 7B and 13B LLMs trained with PPO can outperform GPT-4 agents. Our ablation study highlights the indispensability of memory, tools, consultation, reflection, and reinforcement learning in achieving the agent s strong performance. Datasets and code are available at https://github.com/bytarnish/AGILE.
Dataset Splits	Yes	We evaluate our agent framework on three tasks, Product QA, Med MCQA and Hot Pot QA. For Product QA, we use a two-stage training method based on Vicuna-13b [6]. In the first stage, imitation learning is employed to create agile-vic13b-sft. In the second stage, the policy gradient algorithm of PPO [37] produces agile-vic13b-ppo.
Hardware Specification	Yes	The training runs on NVIDIA-H800.
Software Dependencies	No	The paper mentions models like 'Vicuna-13b' and 'Meerkat-7b' and the 'PPO algorithm', but does not provide specific version numbers for general software dependencies or libraries (e.g., Python, PyTorch, CUDA versions).
Experiment Setup	Yes	We fine-tune the model for 2 epochs with a learning rate of 1e-5 and a batch size of 64. We implement PPO for 1 epoch with a learning rate of 1e-6 and a batch size of 64.