Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Fundamental Limitations in Pointwise Defences of LLM Finetuning APIs
Authors: Xander Davies, Eric Winsor, Alexandra Souly, Tomek Korbak, Robert Kirk, Christian Schroeder de Witt, Yarin Gal
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we show that defences of fine-tuning APIs that seek to detect individual harmful training or inference samples ( pointwise detection) are fundamentally limited in their ability to prevent fine-tuning attacks. We demonstrate a class of pointwise-undetectable attacks that repurpose semantic or syntactic variations in benign model outputs to covertly transmit dangerous knowledge. Our attacks are composed solely of unsuspicious benign samples that can be collected from the model before fine-tuning, meaning training and inference samples are all individually benign and low-perplexity. We test our attacks against the Open AI fine-tuning API, finding they succeed in eliciting answers to harmful multiple-choice questions, and that they evade an enhanced monitoring system we design that successfully detects other fine-tuning attacks. |
| Researcher Affiliation | Collaboration | Xander Davies1,2 Eric Winsor1 Alexandra Souly1 Tomek Korbak1 Robert Kirk1 Christian Schroeder de Witt2 Yarin Gal1,2 1UK AI Security Institute 2University of Oxford |
| Pseudocode | No | The paper does not contain a clearly labeled 'Pseudocode' or 'Algorithm' block. Figure 1 and Figure 5 illustrate conceptual processes with numbered steps and flow but are not presented as formal pseudocode or algorithm blocks. There are no explicit algorithms described in a code-like format. |
| Open Source Code | No | We have not yet released code or data. |
| Open Datasets | Yes | We iterated on those prompts to maximise accuracy on a mixture of several datasets: XSTest [Rรถttger et al., 2024], OR-Bench [Cui et al., 2024] and validation splits of our the set of prompts for IED-MCQ and Copyright-MCQ paired with direct answers (e.g. B , as positive examples) and refusals (as negative examples), see details in Appendix H.1. |
| Dataset Splits | Yes | We sample 211 training samples, 30 validation samples, and 61 test samples for experiments. Validation samples are used only to track progress during fine-tuning. |
| Hardware Specification | No | The results are entirely based on language model API use, so we do not provide compute resource information. |
| Software Dependencies | No | The paper mentions specific AI models and APIs (e.g., Open AI fine-tuning API, GPT-4o, Sonnet 3.5) but does not provide specific version numbers for any ancillary software dependencies or libraries used for their own implementation or analysis. |
| Experiment Setup | Yes | We train for 3 epochs with default fine-tuning hyperparameters. |