Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

AdaDetectGPT: Adaptive Detection of LLM-Generated Text with Statistical Guarantees

Authors: Hongyi Zhou, Jin Zhu, Pingfan Su, Kai Ye, Ying Yang, Shakeel Gavioli-Akilagun, Chengchun Shi

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive numerical studies show Ada Detect GPT nearly uniformly improves the state-of-the-art method in various combination of datasets and LLMs, and the improvement can reach up to 37%.
Researcher Affiliation	Academia	Hongyi Zhou Department of Mathematics Tsinghua University Beijing, China, Jin Zhu School of Mathematics University of Birmingham Birmingham, UK, Pingfan Su Department of Statistics LSE London, UK, Kai Ye Department of Statistics LSE London, UK, Ying Yang Department of Statistics and Data Science Tsinghua University Beijing, China, Shakeel A O B Gavioli-Akilagun Department of Decision Analytics and Operations City University Hong Kong Hongkong, China, Chengchun Shi Department of Statistics LSE London, UK
Pseudocode	No	The paper describes the method and its components through mathematical formulations and textual explanations but does not include a dedicated pseudocode block or algorithm listing.
Open Source Code	Yes	A python implementation of our method is available at https://github.com/Mamba413/Ada Detect GPT.
Open Datasets	Yes	We consider five widely-used datasets for comparing different detectors, including SQu AD for Wikipedia-style question answering (Rajpurkar et al., 2016), Writing Prompts for story generation (Fan et al., 2018), XSum for news summarization (Narayan et al., 2018), Yelp for crowd-sourced product reviews (Zhang et al., 2015), and Essay for high school and university-level essays (Verma et al., 2024).
Dataset Splits	Yes	Following Bao et al. (2024), we randomly sample 500 human-written paragraphs from each dataset and generate an equal number of machine-authored paragraphs by prompting an LLM with the first 120 tokens of the human-written text and requiring it to complete the text with up to 200 tokens. This is a challenging setting where LLM-generated text is mixed with human writing. To evaluate Ada Detect GPT, we compute the AUC on each of the five datasets, with its witness function bw trained on two randomly selected datasets that differ from the test dataset.
Hardware Specification	Yes	Most of experiments are run on a Tesla A100 GPU (40GB) with 10 v CPU Intel Xeon Processor and 72GB RAM. For the experiments where the source model is GPT-Neo X, we run on a H20-NVLink (96GB) GPU with 20 v CPU Intel(R) Xeon(R) Platinum and 200GB RAM.
Software Dependencies	No	The paper mentions 'python implementation' and uses 'torch.float16' and 'torch.float32', implying the use of Python and PyTorch. However, it does not specify explicit version numbers for Python, PyTorch, or any other key software dependencies.
Experiment Setup	Yes	To evaluate Ada Detect GPT, we compute the AUC on each of the five datasets, with its witness function bw trained on two randomly selected datasets that differ from the test dataset. ... For all closed-source models, the temperature parameter is set to 0.8 to encourage the generated text to be creatively diverse and less predictable. ... We run Detect GPT and NPR with default 100 perturbations with the T5 model (Raffel et al., 2020) and run DNA-GPT with a truncate-ratio of 0.5 and 10 prefix completions per passage. ... B-spline relies on two critical tuning parameters: (i) the number of basis functions (n_base) and (ii) the maximum polynomial order. Our experiments fix one parameter while varying the other (with n_base=16 or order=2 as defaults).