Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
De-AntiFake: Rethinking the Protective Perturbations Against Voice Cloning Attacks
Authors: Wei Fan, Kejiang Chen, Chang Liu, Weiming Zhang, Nenghai Yu
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this study, we conduct the first systematic evaluation of these protective perturbations against VC under realistic threat models that include perturbation purification. Our findings reveal that while existing purification methods can neutralize a considerable portion of the protective perturbations, they still lead to distortions in the feature space of VC models, which degrades the performance of VC. From this perspective, we propose a novel two-stage purification method: (1) Purify the perturbed speech; (2) Refine it using phoneme guidance to align it with the clean speech distribution. Experimental results demonstrate that our method outperforms state-of-the-art purification methods in disrupting VC defenses. |
| Researcher Affiliation | Academia | 1Anhui Province Key Laboratory of Digital Security, University of Science and Technology of China, Hefei, China. Correspondence to: Kejiang Chen <EMAIL>. |
| Pseudocode | No | The paper describes its methods and processes in prose and mathematical formulations, but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks or figures displaying structured, code-like steps. |
| Open Source Code | Yes | The code and audio samples are available at https://de-antifake.github.io. |
| Open Datasets | Yes | The evaluation set consists of 25 speakers from the test-clean subset of Libri Speech (Panayotov et al., 2015), each contributing 5 sentences. [...] The Purification model is based on a pretrained unconditional Diff Wave model1, which is then fine-tuned on the Libri Speech (Panayotov et al., 2015) dataset [...] we first add noise randomly selected from the DEMAND dataset (Thiemann et al., 2013) to the train-clean100 subset of Libri Speech (Panayotov et al., 2015), [...] We conduct a small-batch inference on the Russian Libri Speech4, https://www.openslr.org/96/ |
| Dataset Splits | Yes | The evaluation set consists of 25 speakers from the test-clean subset of Libri Speech (Panayotov et al., 2015), each contributing 5 sentences, ranging from short (2-4 seconds) to long (10-15 seconds). These speakers do not overlap with those in the training set of the purification models. We apply each of the six VC methods to the evaluation set, generating a total of 750 (25 × 5 × 6) synthetic speech samples. ... For the training set of the Refinement model, we first add noise randomly selected from the DEMAND dataset (Thiemann et al., 2013) to the train-clean100 subset of Libri Speech (Panayotov et al., 2015)... |
| Hardware Specification | Yes | We run different purification methods on the NVIDIA RTX A6000 and obtain the average time spent processing each second of audio for each method, as shown in Table 9. |
| Software Dependencies | No | The paper mentions using a 'pretrained unconditional Diff Wave model', 'x-vector-based SV', 'd-vector-based SV', and 'NISQA', along with GitHub links for SpeechBrain and Resemblyzer, but does not provide specific version numbers for these software components or libraries. |
| Experiment Setup | Yes | The Purification model is based on a pretrained unconditional Diff Wave model1, which is then fine-tuned on the Libri Speech (Panayotov et al., 2015) dataset for 16k steps with a learning rate of 10-4. ... For reverse diffusion, we set the number of Purification steps as Tpur = 3, and employ the Denoising Diffusion Probabilistic Models (DDPM) sampling method to generate the purified audio. ... The stiffness parameter is fixed at γ = 1.5, the extremal noise levels are set to σmin = 0.05 and σmax = 0.5, and the extremal diffusion times are set to T = 1 and τϵ = 0.03. For reverse diffusion, we use N = 30 time steps and adopt the predictor-corrector scheme (Song et al., 2021), applying one step of annealed Langevin dynamics correction with a step size of r = 0.4. ... using STFT parameters with a window size of 510, a hop length of 128, and a square-root Hann window, all at a sample rate of 16 k Hz. |