Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
DeTeCtive: Detecting AI-generated Text via Multi-Level Contrastive Learning
Authors: Xun Guo, Yongxin He, Shan Zhang, Ting Zhang, Wanquan Feng, Haibin Huang, Chongyang Ma
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that our method enhances the ability of various text encoders in detecting AI-generated text across multiple benchmarks and achieves state-of-the-art results. |
| Researcher Affiliation | Collaboration | Xun Guo1 Shan Zhang2 Yongxin He2 Ting Zhang2 Wanquan Feng1 Haibin Huang1 Chongyang Ma1 1Byte Dance 2University of Chinese Academy of Sciences |
| Pseudocode | No | The paper describes the steps of the proposed method (e.g., multi-level contrastive learning, dense information retrieval pipeline) but does not present them in a structured pseudocode or algorithm block. |
| Open Source Code | Yes | Our code is available at https://github.com/heyongxin233/De Te Ctive |
| Open Datasets | Yes | In this study, we employ three widely-used and challenging datasets to evaluate our proposed method. The Deepfake [39] dataset includes text generated by 27 different LLMs and human-written content from multiple websites across 10 domains, encompassing 332K training and 57K test data. It also outlines six diverse testing scenarios, covering an array of settings from domain-specific to cross-domains, and out-of-distribution detection scenarios. The M4 [68] dataset is a multi-domain, multi-model, and multi-language dataset encompassing data from 8 LLMs, 6 domains, and 9 languages. [...] Finally, we make use of the Turing Bench [61] dataset. |
| Dataset Splits | Yes | The Deepfake [39] dataset includes text generated by 27 different LLMs and human-written content from multiple websites across 10 domains, encompassing 332K training and 57K test data. |
| Hardware Specification | Yes | We train for 50 epochs with batch size of 32 per GPU on 8 NVIDIA V100 GPUs. |
| Software Dependencies | No | For all our method s experiments, we use the interfaces and pre-trained model weights from the Hugging Face transformers [28] library. [...] During inference, we implement with an efficient K-Nearest Neighbors (KNN) [15] algorithm provided by the Faiss [46] library, to perform classification. |
| Experiment Setup | Yes | All experiments use the same hyperparameters and an Adam W [44] optimizer with a cosine annealing learning rate schedule. The peak learning rate is set at 2e-05, warmed up linearly for 2000 steps, and weight decay is set to 1e-04. The maximum input token length is 512. We train for 50 epochs with batch size of 32 per GPU on 8 NVIDIA V100 GPUs. |