Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
DALD: Improving Logits-based Detector without Logits from Black-box LLMs
Authors: Cong Zeng, Shengkun Tang, Xianjun Yang, Yuanzhou Chen, Yiyou Sun, Zhiqiang Xu, Yao Li, Haifeng Chen, Wei Cheng, Dongkuan (DK) Xu
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments validate that our methodology reliably secures high detection precision for LLM-generated text and effectively detects text from diverse model origins through a singular detector. Our approach performs SOTA in black-box settings on different advanced closed-source and open-source models. |
| Researcher Affiliation | Collaboration | MBZUAI1 University of California, Santa Barbara2 University of California, Los Angeles3 NEC Labs America4 University of North Carolina, Chapel Hill5 NC State University6 |
| Pseudocode | No | The paper describes the methodology and process in prose and uses figures (e.g., Figure 3 for framework overview) and equations, but it does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code and data are released at https://github.com/cong-zeng/DALD |
| Open Datasets | Yes | We follow Fast-Detect GPT using four datasets in the black-box detection evaluation, including Xsum[52], Writing Prompts[53], WMT-2016[54] and Pub Med QA[55]. Our training datasets are collected from the open-source datasets, Wild Chat[59] for GPT-3.5 and GPT-4. |
| Dataset Splits | No | The paper specifies training and test sets but does not explicitly mention a separate validation set or details about its use for hyperparameter tuning or early stopping. It states, 'We do not tune the hyperparameters carefully.' |
| Hardware Specification | Yes | For training time, our method finetunes Llama-2-7B with 5K samples on 4 A6000. |
| Software Dependencies | No | The paper mentions using PyTorch and Low Rank Adaptation (Lo RA) but does not provide specific version numbers for these software components or any other key libraries. |
| Experiment Setup | Yes | For Lo RA hyper-parameters, we utilize 16 as the Lo RA rank and set lora_alpha as 32. Dropout is set as 0.05. For training hyperparameters, we set 512 as the max length for texts from GPT-4 and GPT-3.5 models while it is 2048 for texts from Claude-3. We finetune the surrogate model with a learning rate of 1e-4. The batch size is set as 1 per device with gradient accumulation per 4 steps. |