Reflexion: language agents with verbal reinforcement learning
Authors: Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, Shunyu Yao
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Reflexion is flexible enough to incorporate various types (scalar values or free-form language) and sources (external or internally simulated) of feedback signals, and obtains significant improvements over a baseline agent across diverse tasks (sequential decision-making, coding, language reasoning). For example, Reflexion achieves a 91% pass@1 accuracy on the Human Eval coding benchmark, surpassing the previous state-of-the-art GPT-4 that achieves 80%. We also conduct ablation and analysis studies using different feedback signals, feedback incorporation methods, and agent types, and provide insights into how they affect performance. |
| Researcher Affiliation | Academia | Noah Shinn Northeastern University noahshinn024@gmail.com Federico Cassano Northeastern University cassano.f@northeastern.edu Ashwin Gopinath MIT agopi@mit.edu Karthik Narasimhan Princeton University karthikn@princeton.edu Shunyu Yao Princeton University shunyuy@princeton.edu |
| Pseudocode | Yes | Algorithm 1 Reinforcement via self-reflection |
| Open Source Code | Yes | We release all code, demos, and datasets at https://github.com/noahshinn024/reflexion. |
| Open Datasets | Yes | We evaluate various natural language RL setups on decision-making, reasoning, and code generation tasks. Specifically, we challenge an agent to perform search-based question answering on Hot Pot QA [28], multi-step tasks in common household environments in Alf World [24], and code writing tasks in competition-like environments with interpreters and compilers in Human Eval [6], MBPP [2], and Leetcode Hard, a new benchmark. |
| Dataset Splits | No | The paper mentions using specific datasets for evaluation but does not provide specific details on how these datasets were split into training, validation, and test sets with percentages or counts. It refers to using the datasets for evaluation (e.g., "134 Alf World environments", "100 Hot Pot QA questions") but not the splits for learning/training. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments. |
| Software Dependencies | No | The paper mentions using specific LLMs like GPT-3, GPT-3.5, GPT-4, and starchat-beta, and tools like Multi PL-E, but it does not provide specific version numbers for these or other software dependencies like programming languages or libraries. |
| Experiment Setup | Yes | For Co T implementations, we use 6-shot prompting; for Re Act, we use 2-shot prompting, and for self-reflection, we use 2-shot prompting. All examples can be found in the appendix. To avoid very long prompt windows that may exceed the maximum limit, we truncate the agent s memory to the last 3 self-reflections (experiences). Aside from the unit test suite component, the setup for the learning loop for a Reflexion programming agent is identical to the reasoning and decision-making agents with a max memory limit of 1 experience. |