reproducibilityindex.ai

Language Agent Tree Search Unifies Reasoning, Acting, and Planning in Language Models

Authors: Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman, Haohan Wang, Yu-Xiong Wang

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experimental evaluation across diverse domains, including programming, interactive questionanswering (QA), web navigation, and math, validates the effectiveness and generality of LATS in decision-making while maintaining competitive or improved reasoning performance. Notably, LATS achieves state-of-the-art pass@1 accuracy (92.7%) for programming on Human Eval with GPT-4 and demonstrates gradient-free performance (average score of 75.9) comparable to gradient-based fine-tuning for web navigation on Web Shop with GPT-3.5.
Researcher Affiliation	Collaboration	1University of Illinois Urbana-Champaign. 2Lapis Labs.
Pseudocode	Yes	Alg. 1 shows the pseudocode of our algorithm LATS. Nodes are stored explicitly in the memory.
Open Source Code	Yes	Code can be found at https://github.com/lapisrocks/ Language Agent Tree Search.
Open Datasets	Yes	Hot Pot QA (Yang et al., 2018), a multi-hop question-answering benchmark that requires retrieval over two or more Wikipedia passages.
Dataset Splits	No	The paper mentions using subsets of datasets for evaluation (e.g., "randomly selected subset of 100 questions" for Hot Pot QA, "all 164 problems" for HumanEval, "50 environments" for Web Shop, "50 games" for Game of 24), but it does not specify explicit train/validation splits or percentages for reproducing their experimental setup, as they use pre-trained LMs and evaluate their prompting strategy.
Hardware Specification	No	The paper mentions "NVIDIA GPUs at NCSA Delta" but does not specify the model numbers, quantities, or other detailed hardware specifications (CPU, memory, etc.) needed for reproduction.
Software Dependencies	No	The paper does not explicitly state specific software dependencies with version numbers, such as Python versions, deep learning frameworks (e.g., PyTorch, TensorFlow) with their versions, or other library versions.
Experiment Setup	Yes	Unless otherwise specified, in all experiments, we set the number of sampled nodes to n = 5 and the exploration weight to w = 1. We use a self-consistency weight of λ = 0.5 for Hot Pot QA and Game of 24, and λ = 0.8 for Programming and Web Shop.