reproducibilityindex.ai

Large Language Models as Tool Makers

Authors: Tianle Cai, Xuezhi Wang, Tengyu Ma, Xinyun Chen, Denny Zhou

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our approach across various complex reasoning tasks, including Big-Bench tasks. With GPT-4 as the tool maker and GPT-3.5 as the tool user, LATM demonstrates performance equivalent to using GPT-4 for both roles, but with a significantly reduced inference cost. The codebase can be found in https://github.com/ ctlllll/LLM-Tool Maker.
Researcher Affiliation	Collaboration	Tianle Cai1,2 Xuezhi Wang1 Tengyu Ma1,3 Xinyun Chen1 Denny Zhou1 1Google Deepmind 2Princeton University 3Stanford University
Pseudocode	Yes	The appendix 'D WRAPPED TOOLS' provides Python function code snippets (e.g., 'def find_order(objects, constraints):') which serve as structured pseudocode or algorithm blocks for the described tools.
Open Source Code	Yes	The codebase can be found in https://github.com/ ctlllll/LLM-Tool Maker.
Open Datasets	Yes	We evaluate our approach on six datasets from diverse domains, including Logical Deduction, Tracking Shuffled Objects, Dyck Language, Word Sorting, Chinese Remainder Theorem, and Scheduling Meeting. The first five datasets are sourced from Big Bench (Srivastava et al., 2022).
Dataset Splits	Yes	We divide each dataset into training, validation, and test sets, containing 3, 3, and 240 instances, respectively.
Hardware Specification	No	The paper mentions using GPT-4 and GPT-3.5 Turbo models via API calls, but it does not specify the underlying hardware (e.g., GPU models, CPU types, or cloud instance specifications) used by the authors to run their experiments.
Software Dependencies	No	The paper mentions 'Python utility function' and 'Chat Completion API' / 'standard Completion API' for GPT models, but it does not specify version numbers for Python, any libraries (e.g., PyTorch, TensorFlow), or the API clients themselves to ensure reproducibility of the software environment.
Experiment Setup	Yes	Model settings. During the tool-making stage, we set the temperature to 0.3 to introduce randomness to the generation process, allowing for retries if necessary. For this stage, we conduct experiments using GPT-4 and GPT-3.5 Turbo models with the Chat Completion API, always appending the response to the chat history to create an interactive experience. In the tool-using stage, the LLM API call is made only once, and we also perform ablation studies on GPT-3-type models with the standard Completion API. When using the tools, we consistently set the temperature to 0.0. We set the maximal retry times to be 3 for the tool-proposing and tool-verification stages.