Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Human Texts Are Outliers: Detecting LLM-generated Texts via Out-of-distribution Detection

Authors: Cong Zeng, Shengkun Tang, Yuanzhou Chen, Zhiqiang Shen, Wenchao Yu, Xujiang Zhao, Haifeng Chen, Wei Cheng, Zhiqiang Xu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments across multiple datasets validate the effectiveness of our OOD-based approach. Specifically, the OOD-based method achieves 98.3% AUROC and AUPR with only 8.9% FPR95 on Deep Fake dataset.
Researcher Affiliation	Collaboration	Cong Zeng MBZUAI Shengkun Tang MBZUAI Yuanzhou Chen UCLA Zhiqiang Shen MBZUAI Wenchao Yu NEC Lab Xujiang Zhao NEC Lab Haifeng Chen NEC Lab Wei Cheng B NEC Lab Zhiqiang Xu B MBZUAI
Pseudocode	No	The paper describes the methods and overall pipeline using text, equations, and figures (e.g., Figure 2), but does not present any formal pseudocode blocks or algorithm listings.
Open Source Code	Yes	Code, pretrained weights, and demo will be released openly at https://github.com/cong-zeng/ood-llm-detect.
Open Datasets	Yes	Datasets. We test our method on three widely-used and challenging datasets including Deep Fake [71], M4 [72], and RAID [73].
Dataset Splits	Yes	The Deepfake dataset comprises text generated by 27 large language models (LLMs) alongside human-written content sourced from multiple websites across 10 distinct domains, totaling 332K training and 57K test samples. The multilingual settings includes 157K training and 42K testing data. We hold out 10% of training data as validation set for evaluation.
Hardware Specification	Yes	All experiments are conducted on 2*A100 GPUs.
Software Dependencies	No	The paper mentions using 'Adam [75]' as the optimizer and refers to 'official implementation of each baseline' and 'pre-trained weights of De Te Ctive'. However, specific version numbers for software components like Python, PyTorch, or other libraries are not provided.
Experiment Setup	Yes	The learning-rate is set as 2e-5 and the optimizer is Adam [75] with β1 = 0.9 and β2 = 0.98. The loss weights α and β are set as 1 in our experiments. We train all model for 20 epochs. All experiments are conducted on 2*A100 GPUs. For Deep SVDD, we use the machine texts from the training set to compute the initial center and compute the corresponding loss with the center. We freeze the parameters of the center point and disable the optimization on the center point. For HRN, follow its original setting by training a one-class classifier per model family using its corresponding data as the positive class and averaging their scores at inference time. We also follow the original hyperparameter settings, where λ = 0.1 and n = 12. For energy-based method, a classification head is attached following the backbone model. We follow [18] to choose hyper-parameters, where min = 27 and mout = 5.