Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Attractive Metadata Attack: Inducing LLM Agents to Invoke Malicious Tools

Authors: Kanghua Mo, Li Hu, Yucheng Long, Zhihao Li

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments across ten realistic, simulated tool-use scenarios and a range of popular LLM agents demonstrate consistently high attack success rates (81%-95%) and significant privacy leakage, with negligible impact on primary task execution.
Researcher Affiliation Academia 1Cyberspace Institute of Advanced Technology, Guangzhou University 2Department of Electrical and Electronic Engineering, The Hong Kong Polytechnic University
Pseudocode Yes Algorithm 1: AMA Optimization Input: Fixed query set Q, normal tool set NT, generation prompt Pg, maximum iterations K, batch size nt. Output: Optimized malicious tool t.
Open Source Code Yes Code is available at https://github.com/SEAIC-M/AMA.
Open Datasets Yes We assume a static environment defined by a fixed query set Q = {q1, q2, . . . , qnq} and a set of normal tools NT = {T1, T2, . . . , Tn T }, both collected from existing open-source tool-learning systems. At each optimization step, the state S consists of the set of currently generated malicious tools along with their associated invocation probabilities, defined with respect to (Q, NT). Formally, S = {(t, p) | t : generated malicious tool, p : invocation probability}, (2) where the invocation probability p = P(t, Q, NT) is defined as: P(t, Q, NT) = 1 |Q| q Q 1(arg max ı NT {t} S(q, O, Psys, Meta(ı)) = t), (3) and 1( ) is the indicator function that equals 1 if Agent selects t for query q, and 0 otherwise.
Dataset Splits No The paper mentions a "fixed query set Q" and a "set of normal tools NT" collected from open-source systems and states that "The settings of Q and NT follow the configuration used in ASB [32]". However, it does not explicitly provide specific percentages, sample counts, or file names for dataset splits within its own text for training, validation, or testing.
Hardware Specification Yes All experiments were conducted using 8 A100 GPUs (80GB each).
Software Dependencies No The paper mentions "Open-source LLMs were deployed via the Xinference [17] framework in local inference mode." but does not provide specific version numbers for Xinference or any other software libraries/dependencies used, other than the LLM model versions themselves.
Experiment Setup Yes We set the maximum number of optimization iterations K to 5, with a batch size of nt = 10. The settings of Q and NT follow the configuration used in ASB [32]. In each iteration, AMA generates up to 10 new tool candidates for each retained malicious tool and computes their selection probabilities. We evaluate three settings of the weighting coefficient ̻ ∈ {0, 0.5, 1}, and report results for ̻ = 0.5 in the main text, as it offers the most favorable trade-off between convergence speed and attack efficacy. The attack success threshold τ is set to 0.95 for the targeted setting and 0.8 for the untargeted one. Each experiment is repeated 20 times per model-scenario pair, and we report the averaged results.