Sanity Checks for Saliency Metrics

Authors: Richard Tomsett, Dan Harborne, Supriyo Chakraborty, Prudhvi Gurram, Alun Preece6021-6029

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We performed all our experiments on a CNN model trained on the CIFAR-10 dataset and classification task (Krizhevsky 2009). We chose CIFAR-10 because it is a well-known image classification set of suitable complexity (10 non-linearly separable classes) whose size (50,000 32 32 RGB training images, 10,000 test images) is not prohibitive to running many experiments in a reasonable amount of time. We trained a standard CNN containing three sequential blocks, with each block consisting of two 2D convolution layers with batch normalization, followed by a max pooling layer. The resulting model achieved a test set accuracy of 86%. Our experiments were designed to measure the reliability of saliency metrics for saliency maps, as outlined in the previous section. We investigated both AOPCM and faithfulness F.
Researcher Affiliation Collaboration Richard Tomsett,1 Dan Harborne,2 Supriyo Chakraborty,3 Prudhvi Gurram,4 Alun Preece2 1Emerging Technology, IBM Research, Hursley, UK 2Crime and Security Research Institute, Cardiff University, Cardiff, UK 3IBM Research, Yorktown Heights, NY, USA 4Booz Allen Hamilton and CCDC Army Research Laboratory, Adelphi, MD, USA
Pseudocode No The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code No The paper mentions using third-party implementations: "We use the implementations of gradient (which we convert to sensitivity by taking the channel-wise maximum of the magnitude of the gradient), gradient input, and deep Taylor decomposition provided by the i NNvestigate toolbox (Alber et al. 2019), and the implementation of Deep SHAP available at https://github.com/slundberg/shap/." However, it does not state that the authors' own code for their methodology (e.g., sanity checks, metric calculations) is open-source or available.
Open Datasets Yes We performed all our experiments on a CNN model trained on the CIFAR-10 dataset and classification task (Krizhevsky 2009).
Dataset Splits Yes The model was trained on 45,000 training samples, with 5,000 held out as a validation set. During training the model was regularized using l2 weight decay and dropout, and early stopping was used to prevent over-fitting. We performed our experiments on the whole of the CIFAR10 test set of 10,000 images (1000 per class).
Hardware Specification No The paper does not provide any specific details about the hardware used to run the experiments (e.g., CPU, GPU models, memory specifications).
Software Dependencies No The paper mentions using specific tools/libraries like "i NNvestigate toolbox (Alber et al. 2019)" and "Deep SHAP available at https://github.com/slundberg/shap/". However, it does not provide specific version numbers for these or any other software dependencies crucial for reproducibility of their experimental setup.
Experiment Setup Yes The model had all bias terms set to zero, as the inclusion of bias terms can pose difficulties for relevance backpropagation approaches to pixel saliency estimation (Wang, Zhou, and Bilmes 2019; Montavon, Samek, and M uller 2018). The model was trained on 45,000 training samples, with 5,000 held out as a validation set. During training the model was regularized using l2 weight decay and dropout, and early stopping was used to prevent over-fitting.