reproducibilityindex.ai

Evaluations and Methods for Explanation through Robustness Analysis

Authors: Cheng-Yu Hsieh, Chih-Kuan Yeh, Xuanqing Liu, Pradeep Kumar Ravikumar, Seungyeon Kim, Sanjiv Kumar, Cho-Jui Hsieh

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through experiments across multiple domains and a user study, we validate the usefulness of our evaluation criteria and our derived explanations. 5 EXPERIMENTS
Researcher Affiliation	Collaboration	1Paul G. Allen School of Computer Science, University of Washington 2Machine Learning Department, Carnegie Mellon University 3Department of Computer Science, UCLA 4Google Research
Pseudocode	No	The paper describes greedy algorithms in Section 4.1 ('GREEDY ALGORITHM TO COMPUTE OPTIMAL EXPLANATIONS') and 4.2 ('GREEDY BY SET AGGREGATION SCORE') but does not present them in a structured pseudocode or algorithm block.
Open Source Code	Yes	Code available at https://github.com/Cheng Yu Hsieh/explanation_robustness.
Open Datasets	Yes	We perform the experiments on two image datasets, MNIST Le Cun et al. (2010) and Image Net (Deng et al., 2009), as well as a text classiﬁcation dataset, Yahoo! Answers (Zhang et al., 2015).
Dataset Splits	Yes	The training and testing split used in the experiments are the default split as provided by the original dataset.
Hardware Specification	Yes	All the experiments were performed on Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz and NVIDIA Ge Force GTX 1080 Ti GPU.
Software Dependencies	No	The paper mentions using 'Pytorch library' and 'the ofﬁcial GLUE repository', but does not specify their version numbers.
Experiment Setup	Yes	In the experiments, we set the PGD attack step size to be 1.0 and number of steps to be 100. The hyperparameters are chosen such that the PGD attack could most efﬁciently provide the tightest upper bound on the true robustness value. As mentioned in Section 4.2, we solve Eqn. 6 by subsampling from all possible subsets of Str. Speciﬁcally, we compute the coefﬁcients w with respect to 5000 sampled subsets when learning the regression. For all quantitative results, we report the average over 100 random examples. Following common setup (Sundararajan et al., 2017; Ancona et al., 2018), we use zero as the reference value for all explanations that require baseline.