Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
E2E-VGuard: Adversarial Prevention for Production LLM-based End-To-End Speech Synthesis
Authors: Zhisheng Zhang, Derui Wang, Yifan Mi, Zhiyong Wu, Jie Gao, Yuxin Cao, Kai Ye, Minhui Xue, Jie Hao
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | For a comprehensive evaluation, we test 16 open-source synthesizers and 3 commercial APIs across Chinese and English datasets, confirming E2E-VGuard s effectiveness in timbre and pronunciation protection. Real-world deployment validation is also conducted. Our code and demo page are available at https://wxzyd123.github.io/e2e-vguard/. |
| Researcher Affiliation | Academia | 1 Shenzhen International Graduate School, Tsinghua University 2 Beijing University of Posts and Telecommunications 3 CSIRO s Data61 4 Responsible AI Research (RAIR) Centre, The University of Adelaide 5 National University of Singapore 6 The University of Hong Kong Corresponding authors |
| Pseudocode | Yes | B Algorithm Algorithm 1 provides a detailed illustration of each step that E2E-VGuard utilizes to protect audio. |
| Open Source Code | Yes | Our code and demo page are available at https://wxzyd123.github.io/e2e-vguard/. |
| Open Datasets | Yes | We selected both single-speaker and multi-speaker datasets in English and Chinese to verify E2E-VGuard s protection performance across different scenarios, employing Libri TTS [40] for English single-speaker evaluation following [10], CMU ARCTIC [41] for English multi-speaker testing, and THCHS30 [42] for Chinese multi-speaker assessment. |
| Dataset Splits | Yes | For each dataset, we have randomly allocated 80% for training and 20% for testing. If the model requires a validation set, we utilize 10% of the training set as the validation set. |
| Hardware Specification | Yes | All of our experiments are conducted on one NVIDIA 4090 GPU. |
| Software Dependencies | No | The paper mentions software like Wav2vec2 [23] and Whisper [16] as ASR systems and various TTS models, but it does not specify programming language versions (e.g., Python 3.8), library versions (e.g., PyTorch 1.9), or other specific software dependencies with version numbers required to replicate the experiments. |
| Experiment Setup | Yes | For fine-tuning, we keep the conventional settings with training details in Appendix D.1. Moreover, the hyperparameters in Eq. (1) are set to balance the effectiveness and imperceptibility of each component. We determine hyperparameters through experiments evaluating both loss values and component effectiveness, ultimately selecting α = 500 and β = 5 10 3. Additionally, the ϵ in Eq. (1) is 8/255, and we optimize perturbation for 500 iterations. |