Do Input Gradients Highlight Discriminative Features?

Authors: Harshay Shah, Prateek Jain, Praneeth Netrapalli

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 1. We develop an evaluation framework, Diff ROAR, to test assumption (A) on four image classification benchmarks. Our results suggest that (i) input gradients of standard models (i.e., trained on original data) may grossly violate (A), whereas (ii) input gradients of adversarially robust models satisfy (A) reasonably well. 2. We then introduce Block MNIST, an MNIST-based semi-real dataset, that by design encodes a priori knowledge of discriminative features. Our analysis on Block MNIST leverages this information to validate as well as characterize differences between input gradient attributions of standard and robust models. 3. Finally, we theoretically prove that our empirical findings hold on a simplified version of the Block MNIST dataset. Specifically, we prove that input gradients of standard one-hidden-layer MLPs trained on this dataset do not highlight instance-specific signal coordinates, thus grossly violating (A).
Researcher Affiliation Industry Harshay Shah Microsoft Research India harshay@google.com Prateek Jain Microsoft Research India prajain@google.com Praneeth Netrapalli Microsoft Research India pnetrapalli@google.com Part of the work completed after joining Google Research India
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes We believe that the Diff ROAR framework and Block MNIST datasets serve as sanity checks to audit interpretability methods; code and data available at https://github.com/harshays/inputgradients.
Open Datasets Yes We consider four benchmark image classification datasets: SVHN [38], Fashion MNIST [39], CIFAR-10 [40] and Image Net-10 [41]. Image Net-10 is an open-sourced variant (https://github.com/Madry Lab/robustness/) of Imagenet [41]... Our code, along with the proposed datasets, is publicly available at https://github.com/harshays/inputgradients.
Dataset Splits No The paper mentions 'unmasked train and test datasets' but does not explicitly provide details about a validation set or its split percentages/counts.
Hardware Specification No The paper does not explicitly mention any specific hardware used for running the experiments (e.g., GPU models, CPU types, or cloud compute specifications).
Software Dependencies No The paper mentions using MLPs, CNNs, and Resnets, along with PGD adversarial training, but it does not specify any software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes Unless mentioned otherwise, we train models using stochastic gradient descent (SGD), with momentum 0.9, batch size 256, ℓ2 regularization 0.0005 and initial learning rate 0.1 that decays by a factor of 0.75 every 20 epochs. Additionally, we use standard data augmentation and train models for at most 500 epochs, stopping early if cross-entropy loss on training data goes below 0.001.