Executable Code Actions Elicit Better LLM Agents

Authors: Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, Heng Ji

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our extensive analysis of 17 LLMs on APIBank and a newly curated benchmark shows that Code Act outperforms widely used alternatives (up to 20% higher success rate). The encouraging performance of Code Act motivates us to build an open-source LLM agent that interacts with environments by executing interpretable code and collaborates with users using natural language. To this end, we collect an instruction-tuning dataset Code Act Instruct that consists of 7k multi-turn interactions using Code Act.
Researcher Affiliation Collaboration 1Department of Computer Science, University of Illinois Urbana-Champaign 2Apple. Correspondence to: Xingyao Wang <xingyao6@illinois.edu>, Heng Ji <hengji@illinois.edu>.
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes The code, data, model, and demo are available at https: //github.com/xingyaoww/code-act.
Open Datasets Yes Information Seeking: We use a training subset of Hotpot QA (Yang et al., 2018) to generate information-seeking trajectories... Software Package (Tool) Usage: We use the training set of code generation problems in APPS (Hendrycks et al., 2021a) and math problems in MATH (Hendrycks et al., 2021b)... External Memory: We repurpose the training subset of Wiki Table Question (Pasupat & Liang, 2015)... Robot Planning: We use ALFWorld (Shridhar et al., 2020)...
Dataset Splits No The paper does not explicitly state the training, validation, and test dataset splits (e.g., percentages or exact counts) for their own model training. It describes the datasets used for instruction tuning and evaluation on external benchmarks.
Hardware Specification Yes All SFT experiments are performed on one 4x A100 40GB SXM node using a fork of Megatron-LLM (Cano et al., 2023) with a training throughput of around 9k tokens per second.
Software Dependencies No The paper mentions using a "fork of Megatron-LLM (Cano et al., 2023)" but does not specify its version number or versions for other key software libraries or frameworks. Python (Jupyter Notebook) is also mentioned without a version.
Experiment Setup Yes We perform full-parameter supervised finetuning with a sequence length of 4,096 tokens for Llama-2 and 16,384 for Mistral... We use chat ML format... We pack short instances into longer ones and apply flash attention for training efficiency. We train both LLa MA-2 and Mistral LLMs with Tensor Parallel of 4, the learning rate of 1e-5 with 50 warmup steps and cosine decay (end learning rate of 1e-6). We train for five epochs with a batch size of 32. We use the 3rd epoch checkpoint for all our experiments.