reproducibilityindex.ai

Offline Training of Language Model Agents with Functions as Learnable Weights

Authors: Shaokun Zhang, Jieyu Zhang, Jiale Liu, Linxin Song, Chi Wang, Ranjay Krishna, Qingyun Wu

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conducted extensive empirical evaluations on three distinct tasks: mathematical reasoning (MATH) (Hendrycks et al., 2021), tabular processing (Tab MWP) (Lu et al., 2023), and general real-world problems (GAIA) (Mialon et al., 2023). We trained two typical agent systems, GPT-4+ agent (Open AI, 2023) and Re Act agent (Yao et al., 2023), using the agent training method. For the MATH dataset, agent training resulted in an obvious performance improvement in almost all cases.
Researcher Affiliation	Collaboration	Shaokun Zhang * 1 Jieyu Zhang * 2 Jiale Liu 1 Linxin Song 3 Chi Wang 4 Ranjay Krishna 2 Qingyun Wu 1 *Equal contribution 1Pennsylvania State University 2University of Washington 3University of Southern California 4Microsoft Research. Correspondence to: Qingyun Wu <qingyun.wu@psu.edu>.
Pseudocode	Yes	Algorithm 1 Progressive Function Update (Agent Optimizer.step) ... Algorithm 2 Agent Training
Open Source Code	Yes	We have integrated our method into Auto Gen library.
Open Datasets	Yes	We conducted extensive empirical evaluations on three distinct tasks: Mathematical Reasoning, Tabular Processing, and General Real-World Tasks. ... (1) Mathematical reasoning: Following a similar setting with (Yuan et al., 2024), we use a subset of MATH datasets (Hendrycks et al., 2021) to evaluate the LLM agent s performance in addressing mathematical problems. ... (2) Tabular processing: The Tab MWP (Lu et al., 2023) dataset evaluates agents in processing structured data in tables... (3) General real-world tasks: The GAIA dataset (Mialon et al., 2023) is dedicated to evaluating the LLM agents in solving unambiguous real-world questions.
Dataset Splits	No	The paper discusses evaluating performance on the 'training set' for roll-back and early-stopping strategies, but it does not explicitly define or use a separate 'validation set' or provide specific split percentages/counts for a validation split.
Hardware Specification	No	The paper mentions using specific LLM models like 'GPT-4-1106-preview' and 'GPT-3.5-turbo-1106' which are accessed via API (black-box models), but it does not specify the underlying hardware (GPU, CPU, memory, etc.) on which these models or the experiments themselves were run.
Software Dependencies	No	The paper mentions 'Python' and the 'Lizard Python library' but does not provide specific version numbers for these or any other software dependencies needed to replicate the experiment.
Experiment Setup	Yes	The proposed agent training method involves several hyperparameters, including the training epoch, early-stop threshold, and maximum number of actions. In our empirical experiments across all three datasets, we consistently utilized the same hyperparameter configuration for the proposed agent training algorithm. Specifically: (1) We set the training epoch to 10 for all experiments. (2) An early stopping criterion was established with a threshold of 10 epochs. If there were 10 consecutive epochs without any improvement in training performance, the training process terminated. (3) Additionally, we restricted the maximum number of actions taken during each function update step to 3.