Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Tool Unlearning for Tool-Augmented LLMs
Authors: Jiali Cheng, Hadi Amiri
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on multiple tool learning datasets and tool-augmented LLMs show that TOOLDELETE effectively unlearns both randomly selected and class-specific tools, while preserving knowledge on remaining tools and maintaining performance on general tasks. |
| Researcher Affiliation | Academia | 1University of Massachusetts Lowell, USA. Correspondence to: Jiali Cheng <jiali EMAIL>, Hadi Amiri <hadi EMAIL>. |
| Pseudocode | No | The paper describes the TOOLDELETE framework with mathematical formulations and detailed textual explanations of its properties and training details. However, it does not include a clearly labeled pseudocode or algorithm block. |
| Open Source Code | No | The paper refers to public checkpoints of tool-augmented LLMs on Huggingface (Tang Qiao Yu/Tool Alpaca-7B, Tool Bench/Tool LLa MA-2-7b-v2, gorilla-llm/gorilla-openfunctions-v0) as starting points for unlearning. However, it does not provide any explicit statement or link for the source code of the proposed TOOLDELETE methodology itself. |
| Open Datasets | Yes | We experiment with the following datasets and their corresponding LLMs: Tool Alpaca (Tang et al., 2023) is an agent-generated tool learning dataset consisting of 495 tools and 3975 training examples. [...] Tool Bench (Qin et al., 2024) consists of more than 16k real world APIs from 49 categories [...] API-Bench (Patil et al., 2023) focus on APIs that load machine learning models. |
| Dataset Splits | Yes | Then we conduct unlearning experiments with 2 20% tools randomly selected as Tf. |
| Hardware Specification | Yes | All experiments are conducted on 8 NVIDIA A100 GPUs. |
| Software Dependencies | No | The paper mentions specific models like 'Vicuna-v1.3', 'LLa MA-2 7B', and 'LLa MA 7B', and references a 'Python transformers package' in an example. However, it does not list specific software dependencies with their version numbers required to replicate the experimental setup. |
| Experiment Setup | Yes | We use a learning rate of 10 5 across all experiments. |