Language Model Detectors Are Easily Optimized Against

Authors: Charlotte Nicks, Eric Mitchell, Rafael Rafailov, Archit Sharma, Christopher D Manning, Chelsea Finn, Stefano Ermon

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we demonstrate a data-efficient attack that fine-tunes language models to confuse existing detectors, leveraging recent developments in reinforcement learning of language models. [...] Our experiments find that a simple DPO-based pipeline produces consistent reduction in detectability against various detectors.
Researcher Affiliation Academia Charlotte Nicks, Eric Mitchell, Rafael Rafailov, Archit Sharma, Christopher D. Manning, Chelsea Finn, Stefano Ermon Stanford University cnicks13@stanford.edu
Pseudocode No The paper describes the DPO algorithm but does not include a pseudocode block or algorithm explicitly labeled as such.
Open Source Code Yes Models, datasets, and selected experiment code will be released at https://github.com/charlottttee/llm-detector-evasion.
Open Datasets Yes In Sections 4.1-4.3, we generate texts for detection that continue short 8-token prefixes of Open Web Text documents (Gokaslan & Cohen, 2019). For the experiments with chat-tuned models in Section 4.4 s essay-generating case study, we use prompts from the Alpaca instruction dataset (Taori et al., 2023) for generic evasion tuning...
Dataset Splits No To obtain training and testing data, we load the first 110k texts from the openwebtext dataset. Of these, we randomly select 10k evaluation texts. We then save both evaluation and training prompts, which are the first 8 GPT-2 tokens of these texts. All evaluations are done using completions to the eval prompts (including the human completion, which is truncated to approximately match the token count of the data it is being compared to in any given metric, and model-generated completions).
Hardware Specification No For compute, we used widely available consumer hardware and only a few hours of training time.
Software Dependencies No The paper does not specify particular software dependencies with version numbers for reproducibility.
Experiment Setup Yes We fine-tune three Llama-2 7B models on preferences computed from a variety of open source (Table 1) and commercial (Table 2) detectors; each of the three models uses a different β parameter for DPO in the set, which corresponds to the strength of the KL regularization. We use β {0.05, 0.5, 5}. [...] We train each model for up to 30k preference pairs (selecting the lowest AUROC checkpoint at 10k increments).