AnyTool: Self-Reflective, Hierarchical Agents for Large-Scale API Calls

Authors: Yu Du, Fangyun Wei, Hongyang Zhang

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments across various datasets demonstrate the superiority of our Any Tool over strong baselines such as Tool LLM and a GPT-4 variant tailored for tool utilization. For instance, Any Tool outperforms Tool LLM by +35.4% in terms of average pass rate on Tool Bench. Code is available at https://github.com/dyabel/Any Tool.
Researcher Affiliation Collaboration Yu Du * 1 Fangyun Wei * 2 Hongyang Zhang 3 1Tsinghua University 2Microsoft Research Asia 3University of Waterloo.
Pseudocode No The paper describes algorithms like DFSDT and CoT but does not present them in formal pseudocode blocks or explicitly labeled 'Algorithm' sections.
Open Source Code Yes Code is available at https://github.com/dyabel/Any Tool.
Open Datasets Yes We conduct experiments on two benchmarks: 1) Tool Bench (Qin et al., 2023b); and 2) our own benchmark, termed Any Tool Bench. ... To ensure that all queries in the benchmark, namely Tool Bench (Qin et al., 2023b), are solvable using certain APIs from the API pool, we conduct a manual review of all queries. ... The process of creating Any Tool Bench is detailed in Section A.8 of the appendix.
Dataset Splits No The paper does not explicitly provide training/validation dataset splits (e.g., percentages, sample counts, or specific predefined split citations) needed to reproduce data partitioning for their experiments.
Hardware Specification No The paper does not provide specific hardware details such as GPU or CPU models, memory specifications, or cloud instance types used for running experiments.
Software Dependencies No The paper mentions software like GPT-4, GPT-3.5, and Chat GLM but does not specify their version numbers or other ancillary software dependencies with versions.
Experiment Setup Yes For the solver implementing DFSDT, we set the maximum number of API calls to 10. Additionally, for our Any Tool, we establish a limit of 200,000 tokens for efficiency.