reproducibilityindex.ai

ToolChain: Efficient Action Space Navigation in Large Language Models with A Search

Authors: Yuchen Zhuang, Xiang Chen, Tong Yu, Saayan Mitra, Victor Bursztyn, Ryan A. Rossi, Somdeb Sarkhel, Chao Zhang

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on multiple tool-use and reasoning tasks demonstrate that Tool Chain efficiently balances exploration and exploitation within an expansive action space. It outperforms state-of-the-art baselines on planning and reasoning tasks by 3.1% and 3.5% on average while requiring 7.35x and 2.31x less time, respectively.
Researcher Affiliation	Collaboration	Yuchen Zhuang1 , Xiang Chen2, Tong Yu2, Saayan Mitra2 Victor Bursztyn2, Ryan A. Rossi2, Somdeb Sarkhel2, Chao Zhang1 Georgia Institute of Technology1 Adobe Research2 yczhuang@gatech.edu, {xiangche, tyu, smitra}@adobe.com {soaresbu, ryrossi, sarkhel}@adobe.com, chaozhang@gatech.edu
Pseudocode	Yes	Algorithm 1: Tool Chain .
Open Source Code	No	Our code repository is presently undergoing an internal review within the company for public release. Upon receiving approval, we will make both the code and data available on Git Hub.
Open Datasets	Yes	We evaluate Tool Chain on four tool-use environments in Tool Bench (Xu et al., 2023) and one reasoning task in GSM8K (Cobbe et al., 2021).
Dataset Splits	No	The paper mentions using training data for fine-tuning LLMs but does not provide specific train/validation/test splits (percentages, counts, or predefined citations) for the datasets used in its main experiments.
Hardware Specification	Yes	All experiments are conducted on CPU: Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz and GPU: NVIDIA A100 SXM4 80 GB using Python 3.8.13.
Software Dependencies	Yes	All experiments are conducted on CPU: Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz and GPU: NVIDIA A100 SXM4 80 GB using Python 3.8.13. We chose the GPT-3.5-turbo engine for Chat GPT and the GPT-4 engine for GPT-4 when structuring the LLM-based agent. We use 8 NVIDIA A100 SXM4 80 GB GPUs and Fast Chat (Zheng et al., 2023) to tune the LLa MA-2 7B and 13B models on the training data discussed in Appendix F.
Experiment Setup	Yes	The maximum length for generated solutions is set to 512, and the temperature is set to 1 for flexibility in self-consistency frequency function gt,1 (Section 3.2). For LLa MA-2 experiments, the maximum length for generated solutions is set as 256 and the temperature is set to 1. For Tool Chain , we set the weights of geometric means between heuristic and non-heuristic parts in cumulative and future costs as α = β = 0.5 by default. The number of potential actions in self-consistency frequency is set as k = 10 and the maximum step limit is set as T = 20 by default.