Real-Time Neural Voice Camouflage

Authors: Mia Chiquier, Chengzhi Mao, Carl Vondrick

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments show that predictive attacks are able to largely disrupt the established Deep Speech (Amodei et al., 2016) recognition system which was trained on the Libri Speech dataset (Panayotov et al., 2015). On the standard, large-scale dataset Libri Speech, our approach causes at least a three fold increase in word error rate over baselines, and at least a six fold increase in character error rate.
Researcher Affiliation Academia Mia Chiquier, Chengzhi Mao, Carl Vondrick Department of Computer Science, Columbia University New York, NY, 10025 {mac2500, cm3797, cv2428}@columbia.edu
Pseudocode No The paper describes the method algorithmically but does not include a formal pseudocode or algorithm block.
Open Source Code Yes voicecamo.cs.columbia.edu
Open Datasets Yes Libri Speech Dataset: We train on the Libri Speech clean 100 hour dataset, validate on the Libri Speech clean development set, and test on the Libri Speech test set. For our approach, we restrict the amplitude of our predicted attack to be 0.008 times the maximum of the absolute value of the amplitude of the speech signal. We call this the relative amplitude throughout the paper. Intuitively, our attack sounds similar to the sound of a quiet air-conditioner in the background. We additionally evaluate on several baselines, including various levels of white noise as well as projected gradient descent. For some of baselines, we experimented with making the amplitude louder, but never below the amplitude of our predicted attack. In order to measure the time taken fairly, we measured the time necessary to create the attack vector for an input of two seconds averaged over 200 runs.
Dataset Splits Yes We train on the Libri Speech clean 100 hour dataset, validate on the Libri Speech clean development set, and test on the Libri Speech test set.
Hardware Specification Yes We optimized our predictive network gθ for 4 epochs with batch size 32 across 8 NVIDIA RTX 2080 Ti GPUs on the 100-hour Libri Speech dataset. This computation took approximately 2 days.
Software Dependencies No Our code was written in PyTorch (Paszke et al., 2019) and PyTorch-Lightning (Falcon et al., 2019). The paper mentions the software used but does not provide specific version numbers, which is required for reproducibility.
Experiment Setup Yes The input to our network gθ is the Short-Term Fourier Transform (STFT) of the last 2 seconds of the speech signal. The network outputs a waveform of 0.5 seconds, sampled at 16k Hz. To calculate the STFT, we use a hamming window length of 320 samples, hop length of 160 samples, and FFT size of 320, resulting in an input dimension of 2 × 161 × 204. We use a 13 layer convolutional network. The appendix has full network details. [...] We optimized our predictive network gθ for 4 epochs with batch size 32 across 8 NVIDIA RTX 2080 Ti GPUs on the 100-hour Libri Speech dataset. This computation took approximately 2 days. The learning rate started at 1.5 × 10−4 and decreased using an exponential learning rate scheduler, with a learning anneal gamma value of 0.99.