ToolChain*: Efficient Action Space Navigation in Large Language Models with A* Search

Authors: Yuchen Zhuang, Xiang Chen, Tong Yu, Saayan Mitra, Victor Bursztyn, Ryan A. Rossi, Somdeb Sarkhel, Chao Zhang

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on multiple tool-use and reasoning tasks demonstrate that Tool Chain efficiently balances exploration and exploitation within an expansive action space. It outperforms state-of-the-art baselines on planning and reasoning tasks by 3.1% and 3.5% on average while requiring 7.35x and 2.31x less time, respectively.
Researcher Affiliation Collaboration Yuchen Zhuang1 , Xiang Chen2, Tong Yu2, Saayan Mitra2 Victor Bursztyn2, Ryan A. Rossi2, Somdeb Sarkhel2, Chao Zhang1 Georgia Institute of Technology1 Adobe Research2 yczhuang@gatech.edu, {xiangche, tyu, smitra}@adobe.com {soaresbu, ryrossi, sarkhel}@adobe.com, chaozhang@gatech.edu
Pseudocode Yes Algorithm 1: Tool Chain .
Open Source Code No Our code repository is presently undergoing an internal review within the company for public release. Upon receiving approval, we will make both the code and data available on Git Hub.
Open Datasets Yes We evaluate Tool Chain on four tool-use environments in Tool Bench (Xu et al., 2023) and one reasoning task in GSM8K (Cobbe et al., 2021).
Dataset Splits No The paper mentions using training data for fine-tuning LLMs but does not provide specific train/validation/test splits (percentages, counts, or predefined citations) for the datasets used in its main experiments.
Hardware Specification Yes All experiments are conducted on CPU: Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz and GPU: NVIDIA A100 SXM4 80 GB using Python 3.8.13.
Software Dependencies Yes All experiments are conducted on CPU: Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz and GPU: NVIDIA A100 SXM4 80 GB using Python 3.8.13. We chose the GPT-3.5-turbo engine for Chat GPT and the GPT-4 engine for GPT-4 when structuring the LLM-based agent. We use 8 NVIDIA A100 SXM4 80 GB GPUs and Fast Chat (Zheng et al., 2023) to tune the LLa MA-2 7B and 13B models on the training data discussed in Appendix F.
Experiment Setup Yes The maximum length for generated solutions is set to 512, and the temperature is set to 1 for flexibility in self-consistency frequency function gt,1 (Section 3.2). For LLa MA-2 experiments, the maximum length for generated solutions is set as 256 and the temperature is set to 1. For Tool Chain , we set the weights of geometric means between heuristic and non-heuristic parts in cumulative and future costs as α = β = 0.5 by default. The number of potential actions in self-consistency frequency is set as k = 10 and the maximum step limit is set as T = 20 by default.