Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
ExpeL: LLM Agents Are Experiential Learners
Authors: Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, Gao Huang
AAAI 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our empirical results highlight the robust learning efficacy of the Expe L agent, indicating a consistent enhancement in its performance as it accumulates experiences. We further explore the emerging capabilities and transfer learning potential of the Expe L agent through qualitative observations and additional experiments. |
| Researcher Affiliation | Academia | Andrew Zhao1, Daniel Huang2, Quentin Xu2, Matthieu Lin2, Yong-Jin Liu2, Gao Huang1* 1 Department of Automation, BNRist, Tsinghua University 2 Department of Computer Science, BNRist, Tsinghua University EMAIL, EMAIL |
| Pseudocode | Yes | The pseudo-code can be found in Alg. 1. Pseudo-code for this process can be found in Alg. 2. a pseudo-code for this step can be found in Alg. 3. |
| Open Source Code | Yes | 1Visit https://andrewzh112.github.io/#expel for prompts and demos, and https://github.com/Leap Lab THU/Expe L for code. |
| Open Datasets | Yes | The experiments are designed based on four text-based benchmarks: Hotpot QA (Yang et al. 2018), a knowledge-intensive dataset... ALFWorld and Web Shop (Shridhar et al. 2021; Yao et al. 2022) that require the agent... and FEVER (Thorne et al. 2018), that focuses on fact verification tasks... |
| Dataset Splits | Yes | All experiments use four-fold validation, and we report the mean and standard error over the folds. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU models, or memory specifications used for running its experiments. It mentions using LLMs like gpt-3.5-turbo-0613 and gpt-4-0613, which are accessed via APIs, implying the hardware is external to the authors' direct control and not specified. |
| Software Dependencies | Yes | All agents, including Expe L, used gpt-3.5-turbo-0613 when performing actions during evaluation. We use gpt-4-0613 for adapting the Hotpot QA insights into FEVER insights. |
| Experiment Setup | Yes | All experiments use four-fold validation, and we report the mean and standard error over the folds. Following Re Act, for all environments, we use success rate as the evaluation metric: exact matching for Hotpot QA and FEVER, completing the task in time for ALFWorld, and purchasing the item that matches all attributes for Web Shop. All agents, including Expe L, used gpt-3.5-turbo-0613 when performing actions during evaluation. All text generations were done with temperature 0 and greedy decoding. |