AnyTool: Self-Reflective, Hierarchical Agents for Large-Scale API Calls
Authors: Yu Du, Fangyun Wei, Hongyang Zhang
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments across various datasets demonstrate the superiority of our Any Tool over strong baselines such as Tool LLM and a GPT-4 variant tailored for tool utilization. For instance, Any Tool outperforms Tool LLM by +35.4% in terms of average pass rate on Tool Bench. Code is available at https://github.com/dyabel/Any Tool. |
| Researcher Affiliation | Collaboration | Yu Du * 1 Fangyun Wei * 2 Hongyang Zhang 3 1Tsinghua University 2Microsoft Research Asia 3University of Waterloo. |
| Pseudocode | No | The paper describes algorithms like DFSDT and CoT but does not present them in formal pseudocode blocks or explicitly labeled 'Algorithm' sections. |
| Open Source Code | Yes | Code is available at https://github.com/dyabel/Any Tool. |
| Open Datasets | Yes | We conduct experiments on two benchmarks: 1) Tool Bench (Qin et al., 2023b); and 2) our own benchmark, termed Any Tool Bench. ... To ensure that all queries in the benchmark, namely Tool Bench (Qin et al., 2023b), are solvable using certain APIs from the API pool, we conduct a manual review of all queries. ... The process of creating Any Tool Bench is detailed in Section A.8 of the appendix. |
| Dataset Splits | No | The paper does not explicitly provide training/validation dataset splits (e.g., percentages, sample counts, or specific predefined split citations) needed to reproduce data partitioning for their experiments. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU or CPU models, memory specifications, or cloud instance types used for running experiments. |
| Software Dependencies | No | The paper mentions software like GPT-4, GPT-3.5, and Chat GLM but does not specify their version numbers or other ancillary software dependencies with versions. |
| Experiment Setup | Yes | For the solver implementing DFSDT, we set the maximum number of API calls to 10. Additionally, for our Any Tool, we establish a limit of 200,000 tokens for efficiency. |