reproducibilityindex.ai

ReAct: Synergizing Reasoning and Acting in Language Models

Authors: Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, Yuan Cao

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct empirical evaluations of Re Act and state-of-the-art baselines on four diverse benchmarks: question answering (Hot Pot QA, Yang et al., 2018), fact veriﬁcation (Fever, Thorne et al., 2018), text-based game (ALFWorld, Shridhar et al., 2020b), and webpage navigation (Web Shop, Yao et al., 2022).
Researcher Affiliation	Collaboration	Shunyu Yao *,1, Jeffrey Zhao2, Dian Yu2, Nan Du2, Izhak Shafran2, Karthik Narasimhan1, Yuan Cao2 1Department of Computer Science, Princeton University 2Google Research, Brain team
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	Yes	Projet page with code: https://react-lm.github.io/. and The code for these experiments are at https://react-lm.github.io/.
Open Datasets	Yes	We consider two datasets challenging knowledge retrieval and reasoning: (1) Hot Pot QA (Yang et al., 2018), a multi-hop question answering benchmark... and (2) FEVER (Thorne et al., 2018), a fact veriﬁcation benchmark... We also test Re Act on two language-based interactive decision-making tasks, ALFWorld and Web Shop, both of which feature complex environments... ALFWorld (Shridhar et al., 2020b)... Web Shop (Yao et al., 2022)...
Dataset Splits	No	For Hotpot QA and Fever, we randomly select 6 and 3 cases from the training set and manually compose Re Act-format trajectories to use as few-shot exemplars in the prompts. The paper does not specify a separate validation dataset split with percentages or counts for general model training/selection.
Hardware Specification	No	The paper mentions using Pa LM-540B and GPT-3 (text-davinci-002) as the base models but does not provide specific hardware details (e.g., GPU models, CPU types, memory) of the machines used to run their experiments. The reproducibility statement notes Pa LM is not an openly accessible model.
Software Dependencies	No	The paper mentions using specific large language models (Pa LM-540B, GPT-3 text-davinci-002) but does not provide specific version numbers for ancillary software components, libraries, or frameworks used for implementation (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	We set 7 and 5 steps for Hotpot QA and FEVER respectively as we ﬁnd more steps will not improve Re Act performance. We also build a self-consistency baseline (Co T-SC) by sampling 21 Co T trajectories with decoding temperature 0.7 during inference. For all ﬁnetuning we use a batch size of 64. On Pa LM-8B, we ﬁnetune Re Act and Act methods for 4, 000 steps and Standard and Co T methods for 2, 000 steps.