ExpeL: LLM Agents Are Experiential Learners

Authors: Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, Gao Huang

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our empirical results highlight the robust learning efficacy of the Expe L agent, indicating a consistent enhancement in its performance as it accumulates experiences. We further explore the emerging capabilities and transfer learning potential of the Expe L agent through qualitative observations and additional experiments.
Researcher Affiliation Academia Andrew Zhao1, Daniel Huang2, Quentin Xu2, Matthieu Lin2, Yong-Jin Liu2, Gao Huang1* 1 Department of Automation, BNRist, Tsinghua University 2 Department of Computer Science, BNRist, Tsinghua University {zqc21,huang-jy22,xgd22,lyh21}@mails.tsinghua.edu.cn, {liuyongjin,gaohuang}@tsinghua.edu.cn
Pseudocode Yes The pseudo-code can be found in Alg. 1. Pseudo-code for this process can be found in Alg. 2. a pseudo-code for this step can be found in Alg. 3.
Open Source Code Yes 1Visit https://andrewzh112.github.io/#expel for prompts and demos, and https://github.com/Leap Lab THU/Expe L for code.
Open Datasets Yes The experiments are designed based on four text-based benchmarks: Hotpot QA (Yang et al. 2018), a knowledge-intensive dataset... ALFWorld and Web Shop (Shridhar et al. 2021; Yao et al. 2022) that require the agent... and FEVER (Thorne et al. 2018), that focuses on fact verification tasks...
Dataset Splits Yes All experiments use four-fold validation, and we report the mean and standard error over the folds.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU models, or memory specifications used for running its experiments. It mentions using LLMs like gpt-3.5-turbo-0613 and gpt-4-0613, which are accessed via APIs, implying the hardware is external to the authors' direct control and not specified.
Software Dependencies Yes All agents, including Expe L, used gpt-3.5-turbo-0613 when performing actions during evaluation. We use gpt-4-0613 for adapting the Hotpot QA insights into FEVER insights.
Experiment Setup Yes All experiments use four-fold validation, and we report the mean and standard error over the folds. Following Re Act, for all environments, we use success rate as the evaluation metric: exact matching for Hotpot QA and FEVER, completing the task in time for ALFWorld, and purchasing the item that matches all attributes for Web Shop. All agents, including Expe L, used gpt-3.5-turbo-0613 when performing actions during evaluation. All text generations were done with temperature 0 and greedy decoding.