Houdini: Fooling Deep Structured Visual and Speech Recognition Models with Adversarial Examples

Authors: Moustapha M. Cisse, Yossi Adi, Natalia Neverova, Joseph Keshet

NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We successfully apply Houdini to a range of applications such as speech recognition, pose estimation and semantic segmentation. In all cases, the attacks based on Houdini achieve higher success rate than those based on the traditional surrogates used to train the models while using a less perceptible adversarial perturbation.
Researcher Affiliation Collaboration Moustapha Cisse Facebook AI Research moustaphacisse@fb.com Yossi Adi* Bar-Ilan University, Israel yossiadidrum@gmail.com Natalia Neverova* Facebook AI Research nneverova@fb.com Joseph Keshet Bar-Ilan University, Israel jkeshet@cs.biu.ac.il
Pseudocode No The paper provides mathematical equations and descriptions of the proposed loss function and its gradient, but it does not include any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide a link to its source code or explicitly state that the code for the methodology is openly available.
Open Datasets Yes We perform the evaluations on the validation subset of MPII dataset [3] consisting of 3000 images and defined as in [26]. [...] We use a pre-trained Dilation10 model for semantic segmentation [38] and evaluate the success of the attacks on the validation subset of Cityscapes dataset [8]. [...] The model achieves 12% Word Error Rate and 1.5% Character Error Rate on the Librispeech dataset [27], with no additional language modeling.
Dataset Splits No The paper mentions using 'validation subset' for MPII and Cityscapes datasets and 'clean test set' for Librispeech, but it does not specify the exact percentages or counts for training, validation, and test splits needed to reproduce the data partitioning.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., CPU, GPU models, memory, cloud instances) used for running the experiments.
Software Dependencies No The paper mentions using specific models and architectures (e.g., 'Deep Speech-2 based architecture [1]', 'Hourglass networks [26]', 'Dilation10 model [38]') and loss functions (e.g., 'Connectionist Temporal Classification (CTC) loss function [13]'), but it does not specify any version numbers for these or other software dependencies like programming languages or libraries.
Experiment Setup Yes We perform the optimization iteratively till convergence with an update rule ϵ x x where x are gradients with respect to the input and ϵ = 0.1. [...] We used a perturbation of magnitude ϵ = 0.05 and sampled 20 models for Probit (therefore 20x more computationally expensive than Houdini). [...] The model gets as input raw spectrograms (extracted using a window size of 25ms, frame-size of 10ms and Hamming window), and outputs a transcript.