Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Tool-Planner: Task Planning with Clusters across Multiple Tools

Authors: Yanming Liu, Xinyue Peng, Jiannan Cao, S, Yuwei Zhang, Xuhong Zhang, Sheng Cheng, Xun Wang, Jianwei Yin, Tianyu Du

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct extensive experiments on two different datasets: Tool Bench (Qin et al., 2024), which uses APIs selected from Rapid API Hub (Rapid, 2023), and APIBench (Patil et al., 2023), which fetches APIs from various open-source models. Compared to various prior tool-learning search methods, Tool-Planner achieves a +8.8% increase in pass rate and +9.1% increase in win rate on Tool Bench, as well as a +6.6% increase in pass rate and +14.5% increase in win rate on APIBench when tested with GPT-4. It also demonstrates outstanding performance in terms of re-planning frequency and computational speed. Extensive experimental results highlight the advancements of Tool-Planner.
Researcher Affiliation	Academia	1Zhejiang University, 2Southeast University 3Massachusetts Institute of Technology EMAIL, EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1 Tool-Planner Exploration Search
Open Source Code	Yes	Our code is public at https://github.com/Oceann Tw T/Tool-Planner.
Open Datasets	Yes	We utilize the Tool Bench (Qin et al., 2024) and APIBench(Patil et al., 2023) as our experimental dataset. Tool Bench comprises 16,464 APIs, which are categorized into different tools and categories. In Tool Bench, there are three different datasets for prompt generation, namely G1, G2, and G3, which represent single-tool instructions, intra-category multi-tool instructions, and intra-collection multi-tool instructions, respectively. In APIBench, datasets are constructed by selecting corresponding 1,645 APIs and their descriptions from three platforms as tools, and a series of questions are used to evaluate their performance. More details are described in Appendix B. We use the API interfaces selected by Tool Bench and APIBench along with their corresponding documentation and descriptions to extract and generate functional explanations of the APIs. Subsequently, we generate tool embeddings based on their functionalities {m Th}k h=1.
Dataset Splits	Yes	The Tool Bench test set is categorized into six distinct groups: G1-instruction, G1-tool, G1-category, G2-instruction, G2-category, and G3-instruction. Groups labeled with instruction include test instructions that utilize tools from the training set, thereby representing in-domain test data. In contrast, groups labeled with tool or category feature test instructions that do not use tools from the training set, represent out-of-domain test data. Each group consists of 100 user instructions, totaling 400 instructions for the in-domain test set and 200 instructions for the out-of-domain test set.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running experiments. It mentions using various large language models (GPT-3.5, GPT-4, Claude-3, Llama-2-13B) but not the underlying hardware.
Software Dependencies	No	The paper mentions several models and algorithms used (e.g., Sim CSE, k-means++, Ro BERTa-base, Contriever, text-embedding-ada-02) and APIs (Open AI API, Claude API) but does not provide specific version numbers for these software components or any programming languages/libraries.
Experiment Setup	Yes	We set k as 1800 in Tool Bench and k as 65 in APIBench for experiments. We utilize the Kmeans++ Arthur & Vassilvitskii (2007) algorithm, which can quickly converge by pre-setting initial cluster nodes. Additionally, both the Open AI API and Claude API2 interfaces we use have an initial temperature setting of 0.3 for inference and planning.