reproducibilityindex.ai

Language Model Detectors Are Easily Optimized Against

Authors: Charlotte Nicks, Eric Mitchell, Rafael Rafailov, Archit Sharma, Christopher D Manning, Chelsea Finn, Stefano Ermon

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we demonstrate a data-efficient attack that fine-tunes language models to confuse existing detectors, leveraging recent developments in reinforcement learning of language models. [...] Our experiments find that a simple DPO-based pipeline produces consistent reduction in detectability against various detectors.
Researcher Affiliation	Academia	Charlotte Nicks, Eric Mitchell, Rafael Rafailov, Archit Sharma, Christopher D. Manning, Chelsea Finn, Stefano Ermon Stanford University cnicks13@stanford.edu
Pseudocode	No	The paper describes the DPO algorithm but does not include a pseudocode block or algorithm explicitly labeled as such.
Open Source Code	Yes	Models, datasets, and selected experiment code will be released at https://github.com/charlottttee/llm-detector-evasion.
Open Datasets	Yes	In Sections 4.1-4.3, we generate texts for detection that continue short 8-token prefixes of Open Web Text documents (Gokaslan & Cohen, 2019). For the experiments with chat-tuned models in Section 4.4 s essay-generating case study, we use prompts from the Alpaca instruction dataset (Taori et al., 2023) for generic evasion tuning...
Dataset Splits	No	To obtain training and testing data, we load the first 110k texts from the openwebtext dataset. Of these, we randomly select 10k evaluation texts. We then save both evaluation and training prompts, which are the first 8 GPT-2 tokens of these texts. All evaluations are done using completions to the eval prompts (including the human completion, which is truncated to approximately match the token count of the data it is being compared to in any given metric, and model-generated completions).
Hardware Specification	No	For compute, we used widely available consumer hardware and only a few hours of training time.
Software Dependencies	No	The paper does not specify particular software dependencies with version numbers for reproducibility.
Experiment Setup	Yes	We fine-tune three Llama-2 7B models on preferences computed from a variety of open source (Table 1) and commercial (Table 2) detectors; each of the three models uses a different β parameter for DPO in the set, which corresponds to the strength of the KL regularization. We use β {0.05, 0.5, 5}. [...] We train each model for up to 30k preference pairs (selecting the lowest AUROC checkpoint at 10k increments).