ExpeL: LLM Agents Are Experiential Learners
Authors: Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, Gao Huang
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our empirical results highlight the robust learning efficacy of the Expe L agent, indicating a consistent enhancement in its performance as it accumulates experiences. We further explore the emerging capabilities and transfer learning potential of the Expe L agent through qualitative observations and additional experiments. |
| Researcher Affiliation | Academia | Andrew Zhao1, Daniel Huang2, Quentin Xu2, Matthieu Lin2, Yong-Jin Liu2, Gao Huang1* 1 Department of Automation, BNRist, Tsinghua University 2 Department of Computer Science, BNRist, Tsinghua University {zqc21,huang-jy22,xgd22,lyh21}@mails.tsinghua.edu.cn, {liuyongjin,gaohuang}@tsinghua.edu.cn |
| Pseudocode | Yes | The pseudo-code can be found in Alg. 1. Pseudo-code for this process can be found in Alg. 2. a pseudo-code for this step can be found in Alg. 3. |
| Open Source Code | Yes | 1Visit https://andrewzh112.github.io/#expel for prompts and demos, and https://github.com/Leap Lab THU/Expe L for code. |
| Open Datasets | Yes | The experiments are designed based on four text-based benchmarks: Hotpot QA (Yang et al. 2018), a knowledge-intensive dataset... ALFWorld and Web Shop (Shridhar et al. 2021; Yao et al. 2022) that require the agent... and FEVER (Thorne et al. 2018), that focuses on fact verification tasks... |
| Dataset Splits | Yes | All experiments use four-fold validation, and we report the mean and standard error over the folds. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU models, or memory specifications used for running its experiments. It mentions using LLMs like gpt-3.5-turbo-0613 and gpt-4-0613, which are accessed via APIs, implying the hardware is external to the authors' direct control and not specified. |
| Software Dependencies | Yes | All agents, including Expe L, used gpt-3.5-turbo-0613 when performing actions during evaluation. We use gpt-4-0613 for adapting the Hotpot QA insights into FEVER insights. |
| Experiment Setup | Yes | All experiments use four-fold validation, and we report the mean and standard error over the folds. Following Re Act, for all environments, we use success rate as the evaluation metric: exact matching for Hotpot QA and FEVER, completing the task in time for ALFWorld, and purchasing the item that matches all attributes for Web Shop. All agents, including Expe L, used gpt-3.5-turbo-0613 when performing actions during evaluation. All text generations were done with temperature 0 and greedy decoding. |