Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Adversarial Paraphrasing: A Universal Attack for Humanizing AI-Generated Text

Authors: Yize Cheng, Vinu Sankar Sadasivan, Mehrdad Saberi, Shoumik Saha, Soheil Feizi

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments show that our attack is both broadly effective and highly transferable across several detection systems. For instance, compared to simple paraphrasing attack which, ironically, increases the true positive at 1% false positive (T@1%F) by 8.57% on RADAR and 15.03% on Fast-Detect GPT adversarial paraphrasing, guided by Open AI-Ro BERTa-Large, reduces T@1%F by 64.49% on RADAR and a striking 98.96% on Fast-Detect GPT. Across a diverse set of detectors including neural network-based, watermark-based, and zero-shot approaches our attack achieves an average T@1%F reduction of 87.88% under the guidance of Open AI-Ro BERTa-Large.
Researcher Affiliation Academia Yize Cheng Vinu Sankar Sadasivan Mehrdad Saberi Shoumik Saha Soheil Feizi University of Maryland, College Park EMAIL
Pseudocode Yes Algorithm 1 Adversarial Paraphrasing with Guidance for Universal Humanization of AI Texts
Open Source Code Yes Project: https://github.com/chengez/Adversarial-Paraphrasing
Open Datasets Yes For non-watermark-based detectors, we use MAGE [18] as our primary evaluation dataset due to its rich diversity of text sources.
Dataset Splits Yes We randomly sample 2000 AI-generated texts and 2000 human-written texts from MAGE while ensuring that each text is 100 to 500 tokens in length. For watermark-based detectors, we construct watermarked datasets using a watermarked LLa MA-3.18B-Instruct [21]. Specifically, we input the model with the first 20 words of each of the 2000 AI texts as prefix, and let it generate watermarked text 200 to 600 tokens in length.
Hardware Specification Yes We utilize two NVIDIA RTX A6000 GPUs to host both the paraphraser language model and the guidance AI text detector.
Software Dependencies No The paper mentions models like "LLa MA-3-8B-Instruct [21]" and "GPT-4o [23]" as well as the "Hugging Face Transformer library" for specific components, but it does not provide specific version numbers for these or other general software dependencies (e.g., Python, PyTorch).
Experiment Setup Yes We use LLa MA-3-8B-Instruct [21] with a custom system prompt (see Figure 2) as our paraphraser model. During adversarial sampling, we apply top-p and top-k masking with p = 0.99 and k = 50 at each step. We ablate the guidance detector using all four neural network-based detectors considered in our study Open AI-Ro BERTa-Large [30], Open AI-Ro BERTa-Base [30], MAGE [18], and RADAR [8].