Learning explanations that are hard to vary
Authors: Giambattista Parascandolo, Alexander Neitz, Antonio Orvieto, Luigi Gresele, Bernhard Schölkopf
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we investigate the principle that good explanations are hard to vary in the context of deep learning. We show that averaging gradients across examples akin to a logical OR (_) of patterns can favor memorization and patchwork solutions that sew together different strategies, instead of identifying invariances. To inspect this, we first formalize a notion of consistency for minima of the loss surface, which measures to what extent a minimum appears only when examples are pooled. We then propose and experimentally validate a simple alternative algorithm based on a logical AND ( ), that focuses on invariances and prevents memorization in a set of real-world tasks. Finally, using a synthetic dataset with a clear distinction between invariant and spurious mechanisms, we dissect learning signals and compare this approach to well-established regularizers. |
| Researcher Affiliation | Academia | 1MPI for Intelligent Systems, Tübingen, 2ETH, Zürich, 3MPI for Biological Cybernetics, Tübingen |
| Pseudocode | Yes | Algorithm 1: Temporal AND-mask Adam |
| Open Source Code | Yes | Our codebase is publicly available at https://github.com/gibipara92/ learning-explanations-hard-to-vary. |
| Open Datasets | Yes | Finally, using a synthetic dataset with a clear distinction between invariant and spurious mechanisms, we dissect learning signals and compare this approach to well-established regularizers. ... To test our hypothesis, we ran an experiment that closely resembles the one in (Zhang et al., 2017) on CIFAR-10. ... We set up a behavioral cloning task based on the game Coin Run (Cobbe et al., 2019b) |
| Dataset Splits | Yes | The training data consists of 1000 states from each of 64 levels, while test data comes from 2000 levels. ... We ran two automatic hyperparameter optimization studies using Tree-structured Parzen Estimation (TPE) (Bergstra et al., 2013) of 1024 trials. |
| Hardware Specification | No | No specific hardware details (e.g., GPU models, CPU types, memory amounts) are mentioned for the experimental setup. |
| Software Dependencies | No | We used Pytorch Paszke et al. (2017) to implement all experiments in this paper. Our codebase is publicly available at https://github.com/gibipara92/ learning-explanations-hard-to-vary. (No specific PyTorch version or other library versions are mentioned beyond the citation year for PyTorch). |
| Experiment Setup | Yes | Table 1: Hyperparameter ranges for synthetic data experiments. ... The parameters found to work best from the grid search were: agreement threshold of 1, 256 hidden units, 3 hidden layers, batch size 128, Adam with learning rate 1e-2, no batch norm, no dropout, L2-regularization with a coefficient of 1e-4, no L1-regularization. |