Learning explanations that are hard to vary

Authors: Giambattista Parascandolo, Alexander Neitz, Antonio Orvieto, Luigi Gresele, Bernhard Schölkopf

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we investigate the principle that good explanations are hard to vary in the context of deep learning. We show that averaging gradients across examples akin to a logical OR (_) of patterns can favor memorization and patchwork solutions that sew together different strategies, instead of identifying invariances. To inspect this, we first formalize a notion of consistency for minima of the loss surface, which measures to what extent a minimum appears only when examples are pooled. We then propose and experimentally validate a simple alternative algorithm based on a logical AND ( ), that focuses on invariances and prevents memorization in a set of real-world tasks. Finally, using a synthetic dataset with a clear distinction between invariant and spurious mechanisms, we dissect learning signals and compare this approach to well-established regularizers.
Researcher Affiliation Academia 1MPI for Intelligent Systems, Tübingen, 2ETH, Zürich, 3MPI for Biological Cybernetics, Tübingen
Pseudocode Yes Algorithm 1: Temporal AND-mask Adam
Open Source Code Yes Our codebase is publicly available at https://github.com/gibipara92/ learning-explanations-hard-to-vary.
Open Datasets Yes Finally, using a synthetic dataset with a clear distinction between invariant and spurious mechanisms, we dissect learning signals and compare this approach to well-established regularizers. ... To test our hypothesis, we ran an experiment that closely resembles the one in (Zhang et al., 2017) on CIFAR-10. ... We set up a behavioral cloning task based on the game Coin Run (Cobbe et al., 2019b)
Dataset Splits Yes The training data consists of 1000 states from each of 64 levels, while test data comes from 2000 levels. ... We ran two automatic hyperparameter optimization studies using Tree-structured Parzen Estimation (TPE) (Bergstra et al., 2013) of 1024 trials.
Hardware Specification No No specific hardware details (e.g., GPU models, CPU types, memory amounts) are mentioned for the experimental setup.
Software Dependencies No We used Pytorch Paszke et al. (2017) to implement all experiments in this paper. Our codebase is publicly available at https://github.com/gibipara92/ learning-explanations-hard-to-vary. (No specific PyTorch version or other library versions are mentioned beyond the citation year for PyTorch).
Experiment Setup Yes Table 1: Hyperparameter ranges for synthetic data experiments. ... The parameters found to work best from the grid search were: agreement threshold of 1, 256 hidden units, 3 hidden layers, batch size 128, Adam with learning rate 1e-2, no batch norm, no dropout, L2-regularization with a coefficient of 1e-4, no L1-regularization.