Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Advancing Machine-Generated Text Detection from an Easy to Hard Supervision Perspective

Authors: Chenwang Wu, Yiu-ming Cheung, Bo Han, Defu Lian

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments across diverse practical scenarios, including cross-LLM, cross-domain, mixed text, and paraphrase attacks, demonstrate the framework s significant detection effectiveness. We conduct extensive evaluations in various practical scenarios [39], sentence-level, paragraph-level, mixed, paraphrasing, and cross-domain texts.
Researcher Affiliation	Academia	1Department of Computer Science, Hong Kong Baptist University, Hong Kong, China 2School of Computer Science, University of Science and Technology of China, Hefei, China EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1 Easy to Hard Supervision Framework
Open Source Code	Yes	The code is available at: https://github.com/tmlr-group/Easy2Hard. The source files are publiclly available at https://github.com/ tmlr-group/Easy2Hard.
Open Datasets	Yes	We conduct experiments on two public datasets, Essay [16] and Detect RL [32], to validate our effectiveness. The Essay dataset comprises MGTs generated by GPT4All, Chat GPT, Chat GPT-turbo, Chat GLM, Dolly, and Claude. The Detect RL dataset includes MGTs from Pa LM, Chat GPT, Claude, and Llama-2.
Dataset Splits	Yes	For the Detect RL and Essay datasets, we randomly selected 10% of the data as the training set, with the remaining 90% evenly divided into validation and test sets.
Hardware Specification	No	The paper provides running time metrics in Table 3 and Table 36, but does not specify any particular hardware (e.g., GPU models, CPU models, or specific cluster configurations) used for the experiments. It only vaguely refers to 'compute workers' in the NeurIPS checklist without providing details.
Software Dependencies	No	The paper does not provide specific version numbers for any software dependencies or libraries used for the implementation (e.g., Python, PyTorch, TensorFlow, CUDA versions).
Experiment Setup	Yes	Specifically, all models were fine-tuned for 5 epochs, with a batch size set to 32. Regarding learning rates, we set 5e-6 for relatively smaller models like Chat GPT-D and MPU. For the larger RADAR model, we found that a learning rate of 5e-6 led to unstable training, so a smaller learning rate of 1e-6 was chosen. For supervisor-related hyperparameters, the default settings are as follows: the number of texts in longer texts (k = 3), the number of longer texts per batch (N = 128), and the weight (λ = 10).