reproducibilityindex.ai

BiScope: AI-generated Text Detection by Checking Memorization of Preceding Tokens

Authors: Hanxi Guo, Siyuan Cheng, Xiaolong Jin, Zhuo Zhang, Kaiyuan Zhang, Guanhong Tao, Guangyu Shen, Xiangyu Zhang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our system, named BISCOPE, on texts generated by five latest commercial LLMs across five heterogeneous datasets, including both natural language and code. BISCOPE demonstrates superior detection accuracy and robustness compared to nine existing baseline methods, exceeding the state-of-the-art non-commercial methods detection accuracy by over 0.30 F1 score, achieving over 0.95 detection F1 score on average.
Researcher Affiliation	Academia	Hanxi Guo Purdue University guo778@purdue.edu Siyuan Cheng Purdue University cheng535@purdue.edu Xiaolong Jin Purdue University jin509@purdue.edu Zhuo Zhang Purdue University zhan3299@purdue.edu Kaiyuan Zhang Purdue University zhan4057@purdue.edu Guanhong Tao University of Utah guanhong.tao@utah.edu Guangyu Shen Purdue University shen447@purdue.edu Xiangyu Zhang Purdue University xyzhang@cs.purdue.edu
Pseudocode	No	The paper describes the methodology in text but does not include formal pseudocode blocks or algorithm listings.
Open Source Code	Yes	Code is available at https://github.com/Mark GHX/Bi Scope.
Open Datasets	Yes	We use five datasets in our evaluation, including two short natural language datasets (Arxiv [32] and Yelp [32]), two long natural language datasets (Creative [48] and Essay [48]), and one code dataset [8].
Dataset Splits	Yes	For the in-distribution setting, we report a 5-fold cross-validation F1 score using one piece of human data and one piece of AI-generated data from the five latest commercial LLMs, as shown in Table 1.
Hardware Specification	No	The paper mentions 'computational resources' but does not specify any particular CPU or GPU models, memory, or detailed computer specifications used for running the experiments.
Software Dependencies	No	The paper lists various LLM models used (e.g., Gemma-2B, Llama-2-7B, GPT-Neo-2.7B) and refers to 'officially released detection model' for baselines, but it does not provide specific version numbers for general software dependencies like Python, PyTorch, or CUDA.
Experiment Setup	Yes	We present more hyper-parameter settings in Appendix A. For our BISCOPE, we utilize six open-source LLMs in parallel, including Gemma-2B, Gemma-7B, Llama-2-7B, Mistral-7B, Llama-3-8B, and Llama-2-13B. For the text split method, we recommend splitting the text at every 10% length, as used in our paper.