Language Agent Tree Search Unifies Reasoning, Acting, and Planning in Language Models
Authors: Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman, Haohan Wang, Yu-Xiong Wang
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experimental evaluation across diverse domains, including programming, interactive questionanswering (QA), web navigation, and math, validates the effectiveness and generality of LATS in decision-making while maintaining competitive or improved reasoning performance. Notably, LATS achieves state-of-the-art pass@1 accuracy (92.7%) for programming on Human Eval with GPT-4 and demonstrates gradient-free performance (average score of 75.9) comparable to gradient-based fine-tuning for web navigation on Web Shop with GPT-3.5. |
| Researcher Affiliation | Collaboration | 1University of Illinois Urbana-Champaign. 2Lapis Labs. |
| Pseudocode | Yes | Alg. 1 shows the pseudocode of our algorithm LATS. Nodes are stored explicitly in the memory. |
| Open Source Code | Yes | Code can be found at https://github.com/lapisrocks/ Language Agent Tree Search. |
| Open Datasets | Yes | Hot Pot QA (Yang et al., 2018), a multi-hop question-answering benchmark that requires retrieval over two or more Wikipedia passages. |
| Dataset Splits | No | The paper mentions using subsets of datasets for evaluation (e.g., "randomly selected subset of 100 questions" for Hot Pot QA, "all 164 problems" for HumanEval, "50 environments" for Web Shop, "50 games" for Game of 24), but it does not specify explicit train/validation splits or percentages for reproducing their experimental setup, as they use pre-trained LMs and evaluate their prompting strategy. |
| Hardware Specification | No | The paper mentions "NVIDIA GPUs at NCSA Delta" but does not specify the model numbers, quantities, or other detailed hardware specifications (CPU, memory, etc.) needed for reproduction. |
| Software Dependencies | No | The paper does not explicitly state specific software dependencies with version numbers, such as Python versions, deep learning frameworks (e.g., PyTorch, TensorFlow) with their versions, or other library versions. |
| Experiment Setup | Yes | Unless otherwise specified, in all experiments, we set the number of sampled nodes to n = 5 and the exploration weight to w = 1. We use a self-consistency weight of λ = 0.5 for Hot Pot QA and Game of 24, and λ = 0.8 for Programming and Web Shop. |