BiScope: AI-generated Text Detection by Checking Memorization of Preceding Tokens
Authors: Hanxi Guo, Siyuan Cheng, Xiaolong Jin, Zhuo Zhang, Kaiyuan Zhang, Guanhong Tao, Guangyu Shen, Xiangyu Zhang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our system, named BISCOPE, on texts generated by five latest commercial LLMs across five heterogeneous datasets, including both natural language and code. BISCOPE demonstrates superior detection accuracy and robustness compared to nine existing baseline methods, exceeding the state-of-the-art non-commercial methods detection accuracy by over 0.30 F1 score, achieving over 0.95 detection F1 score on average. |
| Researcher Affiliation | Academia | Hanxi Guo Purdue University guo778@purdue.edu Siyuan Cheng Purdue University cheng535@purdue.edu Xiaolong Jin Purdue University jin509@purdue.edu Zhuo Zhang Purdue University zhan3299@purdue.edu Kaiyuan Zhang Purdue University zhan4057@purdue.edu Guanhong Tao University of Utah guanhong.tao@utah.edu Guangyu Shen Purdue University shen447@purdue.edu Xiangyu Zhang Purdue University xyzhang@cs.purdue.edu |
| Pseudocode | No | The paper describes the methodology in text but does not include formal pseudocode blocks or algorithm listings. |
| Open Source Code | Yes | Code is available at https://github.com/Mark GHX/Bi Scope. |
| Open Datasets | Yes | We use five datasets in our evaluation, including two short natural language datasets (Arxiv [32] and Yelp [32]), two long natural language datasets (Creative [48] and Essay [48]), and one code dataset [8]. |
| Dataset Splits | Yes | For the in-distribution setting, we report a 5-fold cross-validation F1 score using one piece of human data and one piece of AI-generated data from the five latest commercial LLMs, as shown in Table 1. |
| Hardware Specification | No | The paper mentions 'computational resources' but does not specify any particular CPU or GPU models, memory, or detailed computer specifications used for running the experiments. |
| Software Dependencies | No | The paper lists various LLM models used (e.g., Gemma-2B, Llama-2-7B, GPT-Neo-2.7B) and refers to 'officially released detection model' for baselines, but it does not provide specific version numbers for general software dependencies like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | We present more hyper-parameter settings in Appendix A. For our BISCOPE, we utilize six open-source LLMs in parallel, including Gemma-2B, Gemma-7B, Llama-2-7B, Mistral-7B, Llama-3-8B, and Llama-2-13B. For the text split method, we recommend splitting the text at every 10% length, as used in our paper. |