reproducibilityindex.ai

A Consistent and Efficient Evaluation Strategy for Attribution Methods

Authors: Yao Rong, Tobias Leemann, Vadim Borisov, Gjergji Kasneci, Enkelejda Kasneci

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we aim to overcome these shortcomings and make the evaluation more consistent and efﬁcient. To this end, we propose a new debiased strategy that compensates for confounders causing inconsistencies. Furthermore, we show that in the debiased setting, we can skip the retraining without signiﬁcant changes in the results. This results in drastic efﬁciency gains as shown in the lower part of Figure 1. We argue that it is crucial for the community to have sound evaluation strategies that do not suffer from limited accessibility due the required compute capacity. Speciﬁcally, we make the following contributions: ... We examine the mechanisms underlying the evaluation strategies based on perturbation by conducting a rigorous information-theoretic analysis, and formally reveal that results can be signiﬁcantly confounded. To compensate for this confounder, we propose the Noisy Linear Imputation strategy and empirically prove its efﬁciency and effectiveness. The proposed strategy signiﬁcantly decreases the sensitivity to hyperparameters such as the removal order. We generalize our ﬁndings to a novel evaluation strategy, ROAD (Rem Ove And Debias), which can be used to objectively and efﬁciently evaluate several attribution methods. Compared to previous evaluation strategies requiring retraining, e.g., Remove and Retrain (ROAR) (Hooker et al., 2019), ROAD saves 99 % of the computational costs.
Researcher Affiliation	Academia	1Department of Computer Science, University of T ubingen, T ubingen, Germany. Correspondence to: Yao Rong <yao.rong@uni-tuebingen.de>, Tobias Leemann <tobias.leemann@uni-tuebingen.de>.
Pseudocode	No	The paper describes methods in prose and mathematical terms but does not include structured pseudocode or algorithm blocks.
Open Source Code	Yes	We release our source code at https://github.com/ tleemann/road_evaluation.
Open Datasets	Yes	To empirically conﬁrm our ﬁndings, we performed experiments on CIFAR-10 (Krizhevsky et al., 2009). ... We also use Food-101 (Bossard et al., 2014), a large-scale dataset of high-resolution images, to validate the generalizability of our method.
Dataset Splits	No	The paper mentions using CIFAR-10 and Food-101 and achieving a test set accuracy, but does not provide specific train/validation/test dataset splits (percentages, counts, or explicit predefined split references).
Hardware Specification	Yes	In our experiments, evaluation using the ROAD took only 0.7 % of the resources required for ROAR, as given by the runtimes in Table 4 obtained on the same hardware (single Nvidia GTX 2080Ti and 8 Cores).
Software Dependencies	No	The paper mentions using ResNet models and SGD optimizer but does not specify software dependencies with version numbers (e.g., PyTorch version, Python version, etc.).
Experiment Setup	Yes	The model is trained with the initial learning rate of 0.01 and the SGD optimizer (Sutskever et al., 2013). We decrease the learning rate by factor 0.1 after 25 and train the model for 40 epochs on one GPU. ... The learning rate was reduced by factor of 0.1 after every 10 epochs. In total, we trained 40 epochs with a batch size of 32.