Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
MELON: Provable Defense Against Indirect Prompt Injection Attacks in AI Agents
Authors: Kaijie Zhu, Xianjun Yang, Jindong Wang, Wenbo Guo, William Yang Wang
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive evaluation on the IPI benchmark Agent Dojo demonstrates that MELON outperforms SOTA defenses in both attack prevention and utility preservation. Moreover, we show that combining MELON with a SOTA prompt augmentation defense (denoted as MELON-Aug) further improves its performance. We also conduct a detailed ablation study to validate our key designs. |
| Researcher Affiliation | Academia | 1University of California, Santa Barbara 2William & Mary. Correspondence to: Kaijie Zhu <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 MELON Algorithm at Step t |
| Open Source Code | Yes | Code is available at https: //github.com/kaijiezhu11/MELON. |
| Open Datasets | Yes | We evaluate MELON on the IPI benchmark Agent Dojo (Debenedetti et al., 2024). Agent Dojo is an evaluation framework for assessing AI agents robustness against indirect prompt injection attacks. |
| Dataset Splits | No | Agent Dojo designs 16, 21, 20, 40 user tasks for their agents, respectively. Besides, each agent also has different attack tasks and injection points. It picks one user task and one attack task to form an attack case, and in total, 629 attack cases. The paper describes how attack cases are formed but does not specify training, validation, or test splits for any dataset, nor does it mention any splitting methodology with percentages, sample counts, or random seeds. |
| Hardware Specification | No | We consider three models as the LLM model in each agent: GPT-4o, o3-mini, and Llama-3.3-70B. The paper specifies the LLM models used but does not provide any specific hardware details such as GPU/CPU models, memory, or cloud computing instance types used for running the experiments. |
| Software Dependencies | No | Next, we employ Open AI s text-embedding-v3 model (Open AI, 2024) that maps these descriptions to dense vector representations. The paper mentions using a specific OpenAI embedding model but does not list any other software dependencies, libraries, or their version numbers that would be necessary to replicate the experiment environment. |
| Experiment Setup | Yes | We set the temperature for each model as 0 to avoid randomness. We set the primary similarity threshold θ = 0.8 to balance detection sensitivity and false positive rate, the ablation study on different similarity thresholds is presented in Section 4.3. The task-neutral prompt Tf is designed to be independent of specific domains or tasks. |