reproducibilityindex.ai

ExpeL: LLM Agents Are Experiential Learners

Authors: Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, Gao Huang

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our empirical results highlight the robust learning efficacy of the Expe L agent, indicating a consistent enhancement in its performance as it accumulates experiences. We further explore the emerging capabilities and transfer learning potential of the Expe L agent through qualitative observations and additional experiments.
Researcher Affiliation	Academia	Andrew Zhao1, Daniel Huang2, Quentin Xu2, Matthieu Lin2, Yong-Jin Liu2, Gao Huang1* 1 Department of Automation, BNRist, Tsinghua University 2 Department of Computer Science, BNRist, Tsinghua University {zqc21,huang-jy22,xgd22,lyh21}@mails.tsinghua.edu.cn, {liuyongjin,gaohuang}@tsinghua.edu.cn
Pseudocode	Yes	The pseudo-code can be found in Alg. 1. Pseudo-code for this process can be found in Alg. 2. a pseudo-code for this step can be found in Alg. 3.
Open Source Code	Yes	1Visit https://andrewzh112.github.io/#expel for prompts and demos, and https://github.com/Leap Lab THU/Expe L for code.
Open Datasets	Yes	The experiments are designed based on four text-based benchmarks: Hotpot QA (Yang et al. 2018), a knowledge-intensive dataset... ALFWorld and Web Shop (Shridhar et al. 2021; Yao et al. 2022) that require the agent... and FEVER (Thorne et al. 2018), that focuses on fact verification tasks...
Dataset Splits	Yes	All experiments use four-fold validation, and we report the mean and standard error over the folds.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU models, or memory specifications used for running its experiments. It mentions using LLMs like gpt-3.5-turbo-0613 and gpt-4-0613, which are accessed via APIs, implying the hardware is external to the authors' direct control and not specified.
Software Dependencies	Yes	All agents, including Expe L, used gpt-3.5-turbo-0613 when performing actions during evaluation. We use gpt-4-0613 for adapting the Hotpot QA insights into FEVER insights.
Experiment Setup	Yes	All experiments use four-fold validation, and we report the mean and standard error over the folds. Following Re Act, for all environments, we use success rate as the evaluation metric: exact matching for Hotpot QA and FEVER, completing the task in time for ALFWorld, and purchasing the item that matches all attributes for Web Shop. All agents, including Expe L, used gpt-3.5-turbo-0613 when performing actions during evaluation. All text generations were done with temperature 0 and greedy decoding.