ToolChain*: Efficient Action Space Navigation in Large Language Models with A* Search
Authors: Yuchen Zhuang, Xiang Chen, Tong Yu, Saayan Mitra, Victor Bursztyn, Ryan A. Rossi, Somdeb Sarkhel, Chao Zhang
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on multiple tool-use and reasoning tasks demonstrate that Tool Chain efficiently balances exploration and exploitation within an expansive action space. It outperforms state-of-the-art baselines on planning and reasoning tasks by 3.1% and 3.5% on average while requiring 7.35x and 2.31x less time, respectively. |
| Researcher Affiliation | Collaboration | Yuchen Zhuang1 , Xiang Chen2, Tong Yu2, Saayan Mitra2 Victor Bursztyn2, Ryan A. Rossi2, Somdeb Sarkhel2, Chao Zhang1 Georgia Institute of Technology1 Adobe Research2 yczhuang@gatech.edu, {xiangche, tyu, smitra}@adobe.com {soaresbu, ryrossi, sarkhel}@adobe.com, chaozhang@gatech.edu |
| Pseudocode | Yes | Algorithm 1: Tool Chain . |
| Open Source Code | No | Our code repository is presently undergoing an internal review within the company for public release. Upon receiving approval, we will make both the code and data available on Git Hub. |
| Open Datasets | Yes | We evaluate Tool Chain on four tool-use environments in Tool Bench (Xu et al., 2023) and one reasoning task in GSM8K (Cobbe et al., 2021). |
| Dataset Splits | No | The paper mentions using training data for fine-tuning LLMs but does not provide specific train/validation/test splits (percentages, counts, or predefined citations) for the datasets used in its main experiments. |
| Hardware Specification | Yes | All experiments are conducted on CPU: Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz and GPU: NVIDIA A100 SXM4 80 GB using Python 3.8.13. |
| Software Dependencies | Yes | All experiments are conducted on CPU: Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz and GPU: NVIDIA A100 SXM4 80 GB using Python 3.8.13. We chose the GPT-3.5-turbo engine for Chat GPT and the GPT-4 engine for GPT-4 when structuring the LLM-based agent. We use 8 NVIDIA A100 SXM4 80 GB GPUs and Fast Chat (Zheng et al., 2023) to tune the LLa MA-2 7B and 13B models on the training data discussed in Appendix F. |
| Experiment Setup | Yes | The maximum length for generated solutions is set to 512, and the temperature is set to 1 for flexibility in self-consistency frequency function gt,1 (Section 3.2). For LLa MA-2 experiments, the maximum length for generated solutions is set as 256 and the temperature is set to 1. For Tool Chain , we set the weights of geometric means between heuristic and non-heuristic parts in cumulative and future costs as α = β = 0.5 by default. The number of potential actions in self-consistency frequency is set as k = 10 and the maximum step limit is set as T = 20 by default. |