reproducibilityindex.ai

Sanity Simulations for Saliency Methods

Authors: Joon Sik Kim, Gregory Plumb, Ameet Talwalkar

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we design a synthetic benchmarking framework, SMERF, that allows us to perform ground-truth-based evaluation while controlling the complexity of the model s reasoning. Experimentally, SMERF reveals significant limitations in existing saliency methods and, as a result, represents a useful tool for the development of new saliency methods. Using SMERF, we consider seven distinct model reasoning settings with varying complexity, and perform an extensive evaluation of 10 leading saliency methods for each setting. Our analyses are summarized in Figure 2 and discussed at length throughout Section 4.
Researcher Affiliation	Academia	1Machine Learning Department, Carnegie Mellon University, Pittsburgh, USA. Correspondence to: Joon Sik Kim <joonkim@cmu.edu>.
Pseudocode	No	The paper describes procedures and workflows using text and diagrams (e.g., Figure 5), but it does not include any formal pseudocode blocks or algorithms labeled as such.
Open Source Code	Yes	To facilitate this process, we provide source code5 that allows a user to (i) run the entire pipeline from generating datasets to computing results; and (ii) evaluate new tasks by encoding new model reasoning and new methods. 5https://github.com/wnstlr/SMERF
Open Datasets	Yes	SMERF instantiates simple and complex model reasoning across these three settings by creating a family of datasets called Text Box. We further illustrate how SMERF s synthetic evaluations translate to more natural images, by presenting qualitatively similar yet generally worse results on analogous reasoning tasks that leverage natural image backgrounds instead of synthetic ones (Section 4.4). We replace the background of the Text Box datasets with real images of baseball stadiums, chosen to simulate tasks that are more similar in spirit to the one depicted in the top panel of Figure 1, sampled from the Places dataset (Zhou et al., 2017).
Dataset Splits	Yes	A convolutional neural network3 is trained on the entire set of buckets, which is then validated with unseen data points from each bucket to ensure that the ground-truth model reasoning has been properly learned. Table 2. The seven model reasoning settings considered in the experiments (Section 4) number of data points used. Total Training 24000... Total Validation 6000...
Hardware Specification	Yes	SMERF overall does not require much computational load as the model sizes are not big; the entire pipeline was tested out on a machine with a single GPU (GTX 1070, 8GB), with a system memory of 16GB.
Software Dependencies	No	The paper mentions software like 'i NNvestigate' and specific implementations like 'Grad-CAM' and 'Deep SHAP', but it does not provide specific version numbers for any of these software components, which are required for reproducibility.
Experiment Setup	Yes	For simple reasoning, we have three convolutional layers (32 filters, kernel-size 3, stride (2,2); 64 filters, kernel-size 3, stride (2,2); 64 filters, kernel-size 3, stride (2,2)), followed by two fully-connected layers (200 units; 2 units), all with Re LU activation functions except for the output layer. For complex reasoning, however... four convolutional layers (64 filters... Learning rate was set as 0.0001, trained with Adam optimizer minimizing the binary cross entropy loss, with maximum epoch of 10.