Language Model Detectors Are Easily Optimized Against
Authors: Charlotte Nicks, Eric Mitchell, Rafael Rafailov, Archit Sharma, Christopher D Manning, Chelsea Finn, Stefano Ermon
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we demonstrate a data-efficient attack that fine-tunes language models to confuse existing detectors, leveraging recent developments in reinforcement learning of language models. [...] Our experiments find that a simple DPO-based pipeline produces consistent reduction in detectability against various detectors. |
| Researcher Affiliation | Academia | Charlotte Nicks, Eric Mitchell, Rafael Rafailov, Archit Sharma, Christopher D. Manning, Chelsea Finn, Stefano Ermon Stanford University cnicks13@stanford.edu |
| Pseudocode | No | The paper describes the DPO algorithm but does not include a pseudocode block or algorithm explicitly labeled as such. |
| Open Source Code | Yes | Models, datasets, and selected experiment code will be released at https://github.com/charlottttee/llm-detector-evasion. |
| Open Datasets | Yes | In Sections 4.1-4.3, we generate texts for detection that continue short 8-token prefixes of Open Web Text documents (Gokaslan & Cohen, 2019). For the experiments with chat-tuned models in Section 4.4 s essay-generating case study, we use prompts from the Alpaca instruction dataset (Taori et al., 2023) for generic evasion tuning... |
| Dataset Splits | No | To obtain training and testing data, we load the first 110k texts from the openwebtext dataset. Of these, we randomly select 10k evaluation texts. We then save both evaluation and training prompts, which are the first 8 GPT-2 tokens of these texts. All evaluations are done using completions to the eval prompts (including the human completion, which is truncated to approximately match the token count of the data it is being compared to in any given metric, and model-generated completions). |
| Hardware Specification | No | For compute, we used widely available consumer hardware and only a few hours of training time. |
| Software Dependencies | No | The paper does not specify particular software dependencies with version numbers for reproducibility. |
| Experiment Setup | Yes | We fine-tune three Llama-2 7B models on preferences computed from a variety of open source (Table 1) and commercial (Table 2) detectors; each of the three models uses a different β parameter for DPO in the set, which corresponds to the strength of the KL regularization. We use β {0.05, 0.5, 5}. [...] We train each model for up to 30k preference pairs (selecting the lowest AUROC checkpoint at 10k increments). |