Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Training-free LLM-generated Text Detection by Mining Token Probability Sequences
Authors: Yihuai Xu, Yongwei Wang, YIFEI BI, Huangsen Cao, Zhouhan Lin, Yu Zhao, Fei Wu
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on six datasets involving cross-domain, cross-model, and cross-lingual detection scenarios, under both white-box and black-box settings, demonstrated that our method consistently achieves state-of-the-art performance. |
| Researcher Affiliation | Academia | 1Zhejiang University 2Georgia Institute of Technology 3Shanghai Jiao Tong University 4Zhejiang Gongshang University |
| Pseudocode | No | The paper describes the methodology with mathematical formulations and a framework diagram (Figure 2), and outlines the detection process in three steps. However, it does not present a dedicated section or block formatted as pseudocode or a clear algorithm. |
| Open Source Code | Yes | 1The code and data are released at https://github.com/Trust Media-zju/Lastde_ Detector. |
| Open Datasets | Yes | The experiments conducted involved 6 distinct datasets, covering a range of languages and topics. Adhering to the setups of Fast-Detect GPT and DNA-GPT, we report the main detection results on 4 datasets: XSum (Narayan et al., 2018) (BBC News documents), SQu AD (Rajpurkar et al., 2016; 2018) (Wikipedia-based Q&A context), Writing Prompts (Fan et al., 2018) (for story generation),and Reddit ELI5 (Fan et al., 2019) (Q&A data restricted to the topics of biology, physics, chemistry, economics, law, and technique). |
| Dataset Splits | Yes | We prefer the latter approach and have fitted logistic regression models on datasets (including Xsum, Writing Prompts, Reddit) generated by two closed-source models (GPT-4-Turbo, GPT-4o) and one open-source model (OPT-13B), reporting metrics on the test set (test size=0.2). |
| Hardware Specification | Yes | Our experimental setup consists of two RTX 3090 GPUs (2 24GB). |
| Software Dependencies | No | The paper lists various LLMs used as source and proxy models, with references to their technical reports or versions (e.g., GPT-4 (Open AI, 2024b), Gemma (Team et al., 2024), GPT-J (Wang & Komatsuzaki, 2021)). However, it does not provide specific version numbers for general software dependencies such as programming languages (e.g., Python) or libraries (e.g., PyTorch, TensorFlow, Hugging Face Transformers) used for implementing their methodology. |
| Experiment Setup | Yes | Furthermore, for Lastde, the 3 hyperparameters are set to default values of s = 3, ε = 10 n, τ = 5, where n is the number of tokens in the text. ... For Lastde++, the default settings are s = 4, ε = 8 n, τ = 15. |