Proactive Detection of Voice Cloning with Localized Watermarking
Authors: Robin San Roman, Pierre Fernandez, Hady Elsahar, Alexandre Défossez, Teddy Furon, Tuan Tran
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate the performance of Audio Seal to detect and localize AI-generated speech. Audio Seal achieves stateof-the-art results on robustness of the detection, far surpassing passive detection with near perfect detection rates over a wide range of audio edits. It also performs samplelevel detection (at resolution of 1/16k second), outperforming Wav Mark in both speed and performance. |
| Researcher Affiliation | Collaboration | 1FAIR, Meta 2Inria 3Kyutai. |
| Pseudocode | No | The paper describes the architecture with diagrams (Figure 4) and text, but it does not provide any pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available at github.com/facebookresearch/audioseal. |
| Open Datasets | Yes | Our watermark generator and detector are trained on a 4.5K hours subset from the Vox Populi (Wang et al., 2021) dataset. |
| Dataset Splits | Yes | Our watermark generator and detector are trained on a 4.5K hours subset from the Vox Populi (Wang et al., 2021) dataset... Detection is done on 10k ten-seconds audios from our Vox Populi validation set. |
| Hardware Specification | Yes | We apply the watermark generator and detector of both models on a dataset of 500 audio segments ranging in length from 1 to 10 seconds, using a single Nvidia Quadro GP100 GPU. |
| Software Dependencies | No | Our loudness function is based on a simplification of the implementation in the torchaudio (Yang et al., 2021) library. Implementation is done with the julius python library. The paper mentions software libraries but does not provide specific version numbers for these or other core dependencies (e.g., Python, PyTorch). |
| Experiment Setup | Yes | We use a sampling rate of 16 k Hz and one-second samples, so T = 16000 in our training. A full training requires 600k steps, with Adam, a learning rate of 10 4, and a batch size of 32. For the drop augmentation, we use k = 5 windows of 0.1 sec. h is set to 32, and the number of additional bits b to 16 (note that h needs to be higher than b, for example h = 8 is enough in the zero-bit case). The perceptual losses are balanced and weighted as follows: λℓ1 = 0.1, λmsspec = 2.0, λadv = 4.0, λloud = 10.0. The localization and watermarking losses are weighted by λloc = 10.0 and λdec = 1.0 respectively. |