ReAct: Synergizing Reasoning and Acting in Language Models
Authors: Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, Yuan Cao
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct empirical evaluations of Re Act and state-of-the-art baselines on four diverse benchmarks: question answering (Hot Pot QA, Yang et al., 2018), fact verification (Fever, Thorne et al., 2018), text-based game (ALFWorld, Shridhar et al., 2020b), and webpage navigation (Web Shop, Yao et al., 2022). |
| Researcher Affiliation | Collaboration | Shunyu Yao *,1, Jeffrey Zhao2, Dian Yu2, Nan Du2, Izhak Shafran2, Karthik Narasimhan1, Yuan Cao2 1Department of Computer Science, Princeton University 2Google Research, Brain team |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Projet page with code: https://react-lm.github.io/. and The code for these experiments are at https://react-lm.github.io/. |
| Open Datasets | Yes | We consider two datasets challenging knowledge retrieval and reasoning: (1) Hot Pot QA (Yang et al., 2018), a multi-hop question answering benchmark... and (2) FEVER (Thorne et al., 2018), a fact verification benchmark... We also test Re Act on two language-based interactive decision-making tasks, ALFWorld and Web Shop, both of which feature complex environments... ALFWorld (Shridhar et al., 2020b)... Web Shop (Yao et al., 2022)... |
| Dataset Splits | No | For Hotpot QA and Fever, we randomly select 6 and 3 cases from the training set and manually compose Re Act-format trajectories to use as few-shot exemplars in the prompts. The paper does not specify a separate validation dataset split with percentages or counts for general model training/selection. |
| Hardware Specification | No | The paper mentions using Pa LM-540B and GPT-3 (text-davinci-002) as the base models but does not provide specific hardware details (e.g., GPU models, CPU types, memory) of the machines used to run their experiments. The reproducibility statement notes Pa LM is not an openly accessible model. |
| Software Dependencies | No | The paper mentions using specific large language models (Pa LM-540B, GPT-3 text-davinci-002) but does not provide specific version numbers for ancillary software components, libraries, or frameworks used for implementation (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | We set 7 and 5 steps for Hotpot QA and FEVER respectively as we find more steps will not improve Re Act performance. We also build a self-consistency baseline (Co T-SC) by sampling 21 Co T trajectories with decoding temperature 0.7 during inference. For all finetuning we use a batch size of 64. On Pa LM-8B, we finetune Re Act and Act methods for 4, 000 steps and Standard and Co T methods for 2, 000 steps. |