reproducibilityindex.ai

Use perturbations when learning from explanations

Authors: Juyeon Heo, Vihari Piratla, Matthew Wicker, Adrian Weller

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate different methods on four datasets: one synthetic and three real-world. The synthetic dataset is similar to decoy-MNIST of Ross et al. (2017) with induced shortcuts and is presented in Section 5.2. For evaluation on practical tasks, we evaluated on a plant phenotyping (Shao et al., 2021) task in Section 5.3, skin cancer detection (Rieger et al., 2020) task presented in Section 5.4, and object classification task presented in Section 5.5.
Researcher Affiliation	Academia	Juyeon Heo University of Cambridge jh2324@cam.ac.uk Vihari Piratla University of Cambridge vp421@cam.ac.uk Matthew Wicker Alan Turing Institute Adrian Weller Alan Turing Institute and University of Cambridge
Pseudocode	No	No explicit pseudocode or algorithm blocks were found in the paper.
Open Source Code	Yes	Our implementation can be found at: https://github.com/vihari/robust_mlx.
Open Datasets	Yes	We evaluate different methods on four datasets: one synthetic and three real-world. The synthetic dataset is similar to decoy-MNIST of Ross et al. (2017) with induced shortcuts and is presented in Section 5.2. For evaluation on practical tasks, we evaluated on a plant phenotyping (Shao et al., 2021) task in Section 5.3, skin cancer detection (Rieger et al., 2020) task presented in Section 5.4, and object classification task presented in Section 5.5. All the datasets contain a known spurious feature, and were used in the past for evaluation of MLX methods. Figure 2 summarises the three datasets, notice that we additionally require in the training dataset the specification of a mask identifying irrelevant features of the input; the patch for ISIC dataset, background for plant dataset, decoy half for Decoy-MNIST images, and label-specific irrelevant region approved by humans for Salient-Imagenet.
Dataset Splits	Yes	We randomly split available labelled data in to training, validation, and test sets in the ratio of (0.75, 0.1, 0.15) for ISIC, (0.65, 0.1, 0.25) for Plant (similar to Schramowski et al. (2020)) and (0.6, 0.15, 0.25) for Salient-Imagenet. We use the standard train-test splits on MNIST.
Hardware Specification	Yes	Table 3 presents the computation costs, including run time and memory usage, for each method using GTX 1080 Ti.
Software Dependencies	No	The paper mentions using a ResNet-18 model and VGG model architectures, but it does not specify version numbers for any software dependencies, libraries, or programming languages used.
Experiment Setup	Yes	We picked the learning rate, optimizer, weight decay, and initialization for best performance with ERM baseline on validation data, which are not further tuned for other baselines unless stated otherwise. We picked the best λ for Grad-Reg and CDEP from [1, 10, 100, 1000]. Additionally, we also tuned β (weight decay) for Grad-Reg from [1e-4, 1e-2, 1, 10]. For Avg-Ex, perturbations were drawn from 0 mean and σ2 variance Gaussian noise, where σ was chosen from [0.03, 0.3, 1, 1.5, 2]. In PGD-Ex, the worst perturbation was optimized from ℓ norm ϵ-ball through seven PGD iterations, where the best ϵ is picked from the range 0.03-5. We did not see much gains when increasing PGD iterations beyond 7, Appendix F contains some results when the number of iterations is varied. In IBP-Ex, we follow the standard procedure of Gowal et al. (2018) to linearly dampen the value of α from 1 to 0.5 and linearly increase the value of ϵ from 0 to ϵmax, where ϵmax is picked from 0.01 to 2. We usually just picked the maximum possible value for ϵmax that converges. For IBP-Ex+Grad-Reg, we have the additional hyperparameter λ (Eqn. 4), which we found to be relatively stable and we set it to 1 for all experiments.