VoiceBlock: Privacy through Real-Time Adversarial Attacks with Audio-to-Audio Models
Authors: Patrick O'Reilly, Andreas Bugler, Keshav Bhandari, Max Morrison, Bryan Pardo
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate our model is highly effective at de-identifying user speech from speaker recognition and able to transfer to an unseen recognition system. We conduct a perceptual study and find that our method produces perturbations significantly less perceptible than baseline anonymization methods, when controlling for effectiveness. We describe experiments used to validate the claimed advances of our work |
| Researcher Affiliation | Academia | Patrick O Reilly, Andreas Bugler, Keshav Bhandari, Max Morrison, Bryan Pardo Department of Computer Science, Northwestern University {patrick.oreilly2024, andreas, keshavbhandari2023}@u.northwestern.edu pardo@northwestern.edu |
| Pseudocode | No | The paper describes the Voice Block architecture and its modules, but does not include any formal pseudocode or algorithm blocks. |
| Open Source Code | Yes | Audio examples and code can be found at https://interactiveaudiolab.github.io/project/voiceblock.html. |
| Open Datasets | Yes | Libri Speech: (CC-BY 4.0) We use both the train-clean-100 and test-clean subsets of the Libri Speech dataset [42] for training Voice Block. Vox Celeb1: (CC-BY 4.0) To simulate large-scale surveiling speaker recognition, we evaluate attacks on the Vox Celeb1 dataset [39] |
| Dataset Splits | Yes | We use both the train-clean-100 and test-clean subsets of the Libri Speech dataset [42] for training Voice Block. The former comprises 28,539 utterances from 251 speakers while the latter comprises 2,620 utterances from 40 speakers. During training, we select fifteen utterances (one minute total) from each source speaker in the training set and compute embeddings... For our Voice Block attack... we select a further ten utterances (40s total) from each source speaker in the training set... Similar to training, during evaluation we select fifteen utterances per speaker as a query set. We again select a further ten utterances to serve as conditioning for the Voice Block attack. Finally, twenty utterances of each speaker are enrolled in the speaker recognition system |
| Hardware Specification | Yes | train for 10 epochs on 3 NVidia RTX 2080 Ti GPUs; We measure performance on a single thread on two different CPUs, an Intel i7-5600U @ 3.2 GHz and an Apple M1 Chip. |
| Software Dependencies | Yes | Our Voice Block model is implemented in Py Torch [2] and contains 6.3m trainable parameters... using the python-pesq implementation [58] (MIT license); and Short-Time Objective Intelligibility (STOI) [54], using the pystoi implementation [43] (MIT license). For both stages, we use the pyworld [22] implementation of the WORLD vocoder [37] (MIT license). |
| Experiment Setup | Yes | To enable streaming, we use unidirectional LSTM layers and pass concatenated outputs through a small lookahead convolutional network [57] to incorporate information from future frames at the expense of a small fixed delay. We find a lookahead of 5 frames (48 ms) is sufficient to craft strong de-identification attacks. To encourage strong de-identification attacks, we take the distance corresponding to the 25th percentile of this distribution as the target threshold P25 for our training loss (see Section 3.4). We train each attack as discussed above. |