Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

DNA-DetectLLM: Unveiling AI-Generated Text via a DNA-Inspired Mutation-Repair Paradigm

Authors: Xiaowei Zhu, Yubing Ren, Fang Fang, Qingfeng Tan, Shi Wang, Yanan Cao

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical evaluations demonstrate that our method achieves state-of-the-art detection performance and exhibits strong robustness against various adversarial attacks and input lengths. Specifically, DNA-Detect LLM achieves relative improvements of 5.55% in AUROC and 2.08% in F1 score across multiple public benchmark datasets.
Researcher Affiliation Academia 1Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China 2School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China 3University International College, Macau University of Science and Technology, Macau 4 Cyberspace Institute of Advanced Technology, Guangzhou University, Guangzhou, China 5 Institute of Computing Science, Chinese Academy of Sciences, Beijing, China EMAIL
Pseudocode No The paper describes the workflow in 3 key steps and illustrates it with Figure 2, but it does not include a clearly labeled pseudocode or algorithm block with structured steps. The steps are described in prose.
Open Source Code Yes Code and data are available at https://github.com/Xiaoweizhu57/DNA-Detect LLM. All code, datasets, and models are well-documented and openly available through access in the supplementary material.
Open Datasets Yes To evaluate performance across diverse domains, we collect 4,800 human-written texts from three representative tasks: news article writing (XSum [23]), story generation (Writing Prompts [8]), and academic writing (Arxiv [24]). We further sample 2,000 balanced examples from each of three high-quality detection benchmarks M4 [33], Detect RL [35], and Real Det [41] to ensure fair and comprehensive evaluation across real-world scenarios.
Dataset Splits Yes For training-based methods, we exclusively train on the HC3 dataset [10], which is entirely disjoint from the test sets. For training-based methods (Biscope and R-Detect), models are trained on 4,000 balanced samples from the HC3 dataset, and the bestperforming checkpoints are selected based on validation performance on a separate 2,000-sample validation set.
Hardware Specification Yes All experiments are conducted on a single NVIDIA A100 GPU with 80GB of memory.
Software Dependencies Yes We standardize the reference (or scoring) model across all methods by employing Falcon-7B-Instruct [25] to compute token generation probabilities. Moreover, Fast-Detect GPT, Binoculars, Lastde++, and DNA-Detect LLM utilize Falcon-7B [25] as the observer (or sampling) model, while Detect GPT uses T5-3B [26]. GPT-4 Turbo: gpt-4-turbo-2024-04-09, Temperature = 1.0, Top-p = 1.0. Gemini 2.0 Flash: gemini-2.0-flash-001, Temperature = 1.0, Top-p = 0.95. Claude 3.7 Sonnet: claude-3-7-sonnet@20250219, Temperature = 1.0, Top-p = 1.0.
Experiment Setup Yes During testing, the maximum input token length is capped at 1024. default settings are used for temperature, top-k, and other generation parameters. No additional hyperparameter tuning is involved in this study. For training-based methods (Biscope and R-Detect), models are trained on 4,000 balanced samples from the HC3 dataset, and the bestperforming checkpoints are selected based on validation performance on a separate 2,000-sample validation set. GPT-4 Turbo: gpt-4-turbo-2024-04-09, Temperature = 1.0, Top-p = 1.0. Gemini 2.0 Flash: gemini-2.0-flash-001, Temperature = 1.0, Top-p = 0.95. Claude 3.7 Sonnet: claude-3-7-sonnet@20250219, Temperature = 1.0, Top-p = 1.0.