reproducibilityindex.ai

VoiceBlock: Privacy through Real-Time Adversarial Attacks with Audio-to-Audio Models

Authors: Patrick O'Reilly, Andreas Bugler, Keshav Bhandari, Max Morrison, Bryan Pardo

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate our model is highly effective at de-identifying user speech from speaker recognition and able to transfer to an unseen recognition system. We conduct a perceptual study and find that our method produces perturbations significantly less perceptible than baseline anonymization methods, when controlling for effectiveness. We describe experiments used to validate the claimed advances of our work
Researcher Affiliation	Academia	Patrick O Reilly, Andreas Bugler, Keshav Bhandari, Max Morrison, Bryan Pardo Department of Computer Science, Northwestern University {patrick.oreilly2024, andreas, keshavbhandari2023}@u.northwestern.edu pardo@northwestern.edu
Pseudocode	No	The paper describes the Voice Block architecture and its modules, but does not include any formal pseudocode or algorithm blocks.
Open Source Code	Yes	Audio examples and code can be found at https://interactiveaudiolab.github.io/project/voiceblock.html.
Open Datasets	Yes	Libri Speech: (CC-BY 4.0) We use both the train-clean-100 and test-clean subsets of the Libri Speech dataset [42] for training Voice Block. Vox Celeb1: (CC-BY 4.0) To simulate large-scale surveiling speaker recognition, we evaluate attacks on the Vox Celeb1 dataset [39]
Dataset Splits	Yes	We use both the train-clean-100 and test-clean subsets of the Libri Speech dataset [42] for training Voice Block. The former comprises 28,539 utterances from 251 speakers while the latter comprises 2,620 utterances from 40 speakers. During training, we select fifteen utterances (one minute total) from each source speaker in the training set and compute embeddings... For our Voice Block attack... we select a further ten utterances (40s total) from each source speaker in the training set... Similar to training, during evaluation we select fifteen utterances per speaker as a query set. We again select a further ten utterances to serve as conditioning for the Voice Block attack. Finally, twenty utterances of each speaker are enrolled in the speaker recognition system
Hardware Specification	Yes	train for 10 epochs on 3 NVidia RTX 2080 Ti GPUs; We measure performance on a single thread on two different CPUs, an Intel i7-5600U @ 3.2 GHz and an Apple M1 Chip.
Software Dependencies	Yes	Our Voice Block model is implemented in Py Torch [2] and contains 6.3m trainable parameters... using the python-pesq implementation [58] (MIT license); and Short-Time Objective Intelligibility (STOI) [54], using the pystoi implementation [43] (MIT license). For both stages, we use the pyworld [22] implementation of the WORLD vocoder [37] (MIT license).
Experiment Setup	Yes	To enable streaming, we use unidirectional LSTM layers and pass concatenated outputs through a small lookahead convolutional network [57] to incorporate information from future frames at the expense of a small fixed delay. We find a lookahead of 5 frames (48 ms) is sufficient to craft strong de-identification attacks. To encourage strong de-identification attacks, we take the distance corresponding to the 25th percentile of this distribution as the target threshold P25 for our training loss (see Section 3.4). We train each attack as discussed above.