ReAct: Synergizing Reasoning and Acting in Language Models

Authors: Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, Yuan Cao

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct empirical evaluations of Re Act and state-of-the-art baselines on four diverse benchmarks: question answering (Hot Pot QA, Yang et al., 2018), fact verification (Fever, Thorne et al., 2018), text-based game (ALFWorld, Shridhar et al., 2020b), and webpage navigation (Web Shop, Yao et al., 2022).
Researcher Affiliation Collaboration Shunyu Yao *,1, Jeffrey Zhao2, Dian Yu2, Nan Du2, Izhak Shafran2, Karthik Narasimhan1, Yuan Cao2 1Department of Computer Science, Princeton University 2Google Research, Brain team
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes Projet page with code: https://react-lm.github.io/. and The code for these experiments are at https://react-lm.github.io/.
Open Datasets Yes We consider two datasets challenging knowledge retrieval and reasoning: (1) Hot Pot QA (Yang et al., 2018), a multi-hop question answering benchmark... and (2) FEVER (Thorne et al., 2018), a fact verification benchmark... We also test Re Act on two language-based interactive decision-making tasks, ALFWorld and Web Shop, both of which feature complex environments... ALFWorld (Shridhar et al., 2020b)... Web Shop (Yao et al., 2022)...
Dataset Splits No For Hotpot QA and Fever, we randomly select 6 and 3 cases from the training set and manually compose Re Act-format trajectories to use as few-shot exemplars in the prompts. The paper does not specify a separate validation dataset split with percentages or counts for general model training/selection.
Hardware Specification No The paper mentions using Pa LM-540B and GPT-3 (text-davinci-002) as the base models but does not provide specific hardware details (e.g., GPU models, CPU types, memory) of the machines used to run their experiments. The reproducibility statement notes Pa LM is not an openly accessible model.
Software Dependencies No The paper mentions using specific large language models (Pa LM-540B, GPT-3 text-davinci-002) but does not provide specific version numbers for ancillary software components, libraries, or frameworks used for implementation (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes We set 7 and 5 steps for Hotpot QA and FEVER respectively as we find more steps will not improve Re Act performance. We also build a self-consistency baseline (Co T-SC) by sampling 21 Co T trajectories with decoding temperature 0.7 during inference. For all finetuning we use a batch size of 64. On Pa LM-8B, we finetune Re Act and Act methods for 4, 000 steps and Standard and Co T methods for 2, 000 steps.