Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Adversarial Paraphrasing: A Universal Attack for Humanizing AI-Generated Text
Authors: Yize Cheng, Vinu Sankar Sadasivan, Mehrdad Saberi, Shoumik Saha, Soheil Feizi
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments show that our attack is both broadly effective and highly transferable across several detection systems. For instance, compared to simple paraphrasing attack which, ironically, increases the true positive at 1% false positive (T@1%F) by 8.57% on RADAR and 15.03% on Fast-Detect GPT adversarial paraphrasing, guided by Open AI-Ro BERTa-Large, reduces T@1%F by 64.49% on RADAR and a striking 98.96% on Fast-Detect GPT. Across a diverse set of detectors including neural network-based, watermark-based, and zero-shot approaches our attack achieves an average T@1%F reduction of 87.88% under the guidance of Open AI-Ro BERTa-Large. |
| Researcher Affiliation | Academia | Yize Cheng Vinu Sankar Sadasivan Mehrdad Saberi Shoumik Saha Soheil Feizi University of Maryland, College Park EMAIL |
| Pseudocode | Yes | Algorithm 1 Adversarial Paraphrasing with Guidance for Universal Humanization of AI Texts |
| Open Source Code | Yes | Project: https://github.com/chengez/Adversarial-Paraphrasing |
| Open Datasets | Yes | For non-watermark-based detectors, we use MAGE [18] as our primary evaluation dataset due to its rich diversity of text sources. |
| Dataset Splits | Yes | We randomly sample 2000 AI-generated texts and 2000 human-written texts from MAGE while ensuring that each text is 100 to 500 tokens in length. For watermark-based detectors, we construct watermarked datasets using a watermarked LLa MA-3.18B-Instruct [21]. Specifically, we input the model with the first 20 words of each of the 2000 AI texts as prefix, and let it generate watermarked text 200 to 600 tokens in length. |
| Hardware Specification | Yes | We utilize two NVIDIA RTX A6000 GPUs to host both the paraphraser language model and the guidance AI text detector. |
| Software Dependencies | No | The paper mentions models like "LLa MA-3-8B-Instruct [21]" and "GPT-4o [23]" as well as the "Hugging Face Transformer library" for specific components, but it does not provide specific version numbers for these or other general software dependencies (e.g., Python, PyTorch). |
| Experiment Setup | Yes | We use LLa MA-3-8B-Instruct [21] with a custom system prompt (see Figure 2) as our paraphraser model. During adversarial sampling, we apply top-p and top-k masking with p = 0.99 and k = 50 at each step. We ablate the guidance detector using all four neural network-based detectors considered in our study Open AI-Ro BERTa-Large [30], Open AI-Ro BERTa-Base [30], MAGE [18], and RADAR [8]. |