Proactive Detection of Voice Cloning with Localized Watermarking

Authors: Robin San Roman, Pierre Fernandez, Hady Elsahar, Alexandre Défossez, Teddy Furon, Tuan Tran

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate the performance of Audio Seal to detect and localize AI-generated speech. Audio Seal achieves stateof-the-art results on robustness of the detection, far surpassing passive detection with near perfect detection rates over a wide range of audio edits. It also performs samplelevel detection (at resolution of 1/16k second), outperforming Wav Mark in both speed and performance.
Researcher Affiliation Collaboration 1FAIR, Meta 2Inria 3Kyutai.
Pseudocode No The paper describes the architecture with diagrams (Figure 4) and text, but it does not provide any pseudocode or algorithm blocks.
Open Source Code Yes Code is available at github.com/facebookresearch/audioseal.
Open Datasets Yes Our watermark generator and detector are trained on a 4.5K hours subset from the Vox Populi (Wang et al., 2021) dataset.
Dataset Splits Yes Our watermark generator and detector are trained on a 4.5K hours subset from the Vox Populi (Wang et al., 2021) dataset... Detection is done on 10k ten-seconds audios from our Vox Populi validation set.
Hardware Specification Yes We apply the watermark generator and detector of both models on a dataset of 500 audio segments ranging in length from 1 to 10 seconds, using a single Nvidia Quadro GP100 GPU.
Software Dependencies No Our loudness function is based on a simplification of the implementation in the torchaudio (Yang et al., 2021) library. Implementation is done with the julius python library. The paper mentions software libraries but does not provide specific version numbers for these or other core dependencies (e.g., Python, PyTorch).
Experiment Setup Yes We use a sampling rate of 16 k Hz and one-second samples, so T = 16000 in our training. A full training requires 600k steps, with Adam, a learning rate of 10 4, and a batch size of 32. For the drop augmentation, we use k = 5 windows of 0.1 sec. h is set to 32, and the number of additional bits b to 16 (note that h needs to be higher than b, for example h = 8 is enough in the zero-bit case). The perceptual losses are balanced and weighted as follows: λℓ1 = 0.1, λmsspec = 2.0, λadv = 4.0, λloud = 10.0. The localization and watermarking losses are weighted by λloc = 10.0 and λdec = 1.0 respectively.