Spotting LLMs With Binoculars: Zero-Shot Detection of Machine-Generated Text

Authors: Abhimanyu Hans, Avi Schwarzschild, Valeriia Cherepanova, Hamid Kazemi, Aniruddha Saha, Micah Goldblum, Jonas Geiping, Tom Goldstein

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We comprehensively evaluate Binoculars on a number of text sources and in varied situations. Over a wide range of document types, Binoculars detects over 90% of generated samples from Chat GPT (and other LLMs) at a false positive rate of 0.01%, despite not being trained on any Chat GPT data.
Researcher Affiliation Academia 1University of Maryland 2Carnegie Mellon University 3New York University 4ELLIS Institute T ubingen, MPI Intelligent Systems. Correspondence to: Abhimanyu Hans <ahans1@umd.edu>, Avi Schwarzschild <schwarzschild@cmu.edu>.
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes Code available at https://github.com/ahans30/Binoculars.
Open Datasets Yes The most recent baseline to which we compare is Ghostbuster. Verma et al. (2023), who propose this method, also introduce three datasets that we include in our study: Writing Prompts, News, and Student Essay. [...] Drawing samples of human-written text from CCNews (Hamborg et al., 2017), Pub Med (Sen et al., 2008), and CNN (Hermann et al., 2015), we generate corresponding, machine-generated completions using LLa MA2-7B and Falcon-7B [...] Further, we use the Orca dataset (Lian et al., 2023)... M4) detection datasets (Wang et al., 2023).
Dataset Splits No The paper states 'We set the threshold using the combination of training splits from all of our reference datasets: News, Creative Writing, and Student Essay datasets from Verma et al. (2023)', implying usage of training splits for threshold tuning, but it does not provide specific percentages, sample counts, or detailed splitting methodology for all datasets used to reproduce the data partitioning.
Hardware Specification Yes We finetune pretrained Falcon-7B on alpaca instruction dataset with 5e-5 learning rate and 65K tokens batch size (32 samples * 2048 block size) with cosine annealing ratio of 3% on 4 A5000 GPUs using FSDP distributed training.
Software Dependencies No The paper does not provide specific version numbers for software dependencies or libraries used in the experiments (e.g., specific Python, PyTorch, or TensorFlow versions).
Experiment Setup Yes We finetune pretrained Falcon-7B on alpaca instruction dataset with 5e-5 learning rate and 65K tokens batch size (32 samples * 2048 block size) with cosine annealing ratio of 3% on 4 A5000 GPUs using FSDP distributed training.