DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability Curvature
Authors: Eric Mitchell, Yoonho Lee, Alexander Khazatsky, Christopher D Manning, Chelsea Finn
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments find that Detect GPT is more accurate than existing zero-shot methods for detecting machine-generated text, improving over the strongest zero-shot baseline by over 0.1 AUROC for multiple source models when detecting machine-generated news articles. We conduct experiments to better understand multiple facets of machine-generated text detection; we study the effectiveness of Detect GPT for zero-shot machine-generated text detection compared to prior zero-shot approaches, the impact of distribution shift on zero-shot and supervised detectors, and detection accuracy for the largest publicly-available models. |
| Researcher Affiliation | Academia | Eric Mitchell 1 Yoonho Lee 1 Alexander Khazatsky 1 Christopher D. Manning 1 Chelsea Finn 1 1Stanford University. Correspondence to: Eric Mitchell <eric.mitchell@cs.stanford.edu>. |
| Pseudocode | Yes | Algorithm 1 Detect GPT model-generated text detection |
| Open Source Code | Yes | See ericmitchell.ai/detectgpt for code, data, and other project information. |
| Open Datasets | Yes | Datasets & metrics Our experiments use six datasets that cover a variety of everyday domains and LLM use-cases. We use news articles from the XSum dataset (Narayan et al., 2018); for model samples, we use the output of four different LLMs when prompted with the first 30 tokens of each article in XSum. We use news articles from the XSum dataset (Narayan et al., 2018) to represent fake news detection, Wikipedia paragraphs from SQu AD contexts (Rajpurkar et al., 2016) to represent machine-written academic essays, and prompted stories from the Reddit Writing Prompts dataset (Fan et al., 2018) to represent detecting machine-generated creative writing submissions. To evaluate robustness to distribution shift, we also use the English and German splits of WMT16 (Bojar et al., 2016) as well as long-form answers written by human experts in the Pub Med QA dataset (Jin et al., 2019). |
| Dataset Splits | No | The paper evaluates Detect GPT on various datasets using 150-500 examples for evaluation, but as a zero-shot method, it does not specify explicit training or validation dataset splits for its own operation, only test/evaluation sets are used. |
| Hardware Specification | No | The Stanford Center for Research on Foundation Models (CRFM) provided part of the compute resources used for the experiments in this work. This statement is too general and does not provide specific hardware details such as GPU models, CPU types, or memory. |
| Software Dependencies | No | The paper mentions various models used (e.g., T5-3B, GPT-2, RoBERTa), but it does not specify software dependencies like programming language versions (e.g., Python 3.8) or library versions (e.g., PyTorch 1.9) required for reproducibility. |
| Experiment Setup | Yes | Hyperparameters. The key hyperparameters of Detect GPT are the fraction of words masked for perturbation, the length of the masked spans, the model used for mask filling, and the sampling hyperparameters for the mask-filling model. Using BERT (Devlin et al., 2019) masked language modeling as inspiration, we use 15% as the mask rate. We performed a small sweep over masked span lengths of {2, 5, 10} on a held-out set of XSum data, finding 2 to perform best. We use these settings for all experiments, without re-tuning. We use T5-3B for almost all experiments, except for GPT-Neo X and GPT-3 experiments, where compute resources allowed for the larger T5-11B model; we also use m T5-3B instead of T5-3B for the WMT multilingual experiment. We do not tune the hyperparameters for the mask filling model, sampling directly with temperature 1. |