Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Learning explanations that are hard to vary
Authors: Giambattista Parascandolo, Alexander Neitz, Antonio Orvieto, Luigi Gresele, Bernhard Schölkopf
ICLR 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we investigate the principle that good explanations are hard to vary in the context of deep learning. We show that averaging gradients across examples akin to a logical OR (_) of patterns can favor memorization and patchwork solutions that sew together different strategies, instead of identifying invariances. To inspect this, we first formalize a notion of consistency for minima of the loss surface, which measures to what extent a minimum appears only when examples are pooled. We then propose and experimentally validate a simple alternative algorithm based on a logical AND ( ), that focuses on invariances and prevents memorization in a set of real-world tasks. Finally, using a synthetic dataset with a clear distinction between invariant and spurious mechanisms, we dissect learning signals and compare this approach to well-established regularizers. |
| Researcher Affiliation | Academia | 1MPI for Intelligent Systems, Tübingen, 2ETH, Zürich, 3MPI for Biological Cybernetics, Tübingen |
| Pseudocode | Yes | Algorithm 1: Temporal AND-mask Adam |
| Open Source Code | Yes | Our codebase is publicly available at https://github.com/gibipara92/ learning-explanations-hard-to-vary. |
| Open Datasets | Yes | Finally, using a synthetic dataset with a clear distinction between invariant and spurious mechanisms, we dissect learning signals and compare this approach to well-established regularizers. ... To test our hypothesis, we ran an experiment that closely resembles the one in (Zhang et al., 2017) on CIFAR-10. ... We set up a behavioral cloning task based on the game Coin Run (Cobbe et al., 2019b) |
| Dataset Splits | Yes | The training data consists of 1000 states from each of 64 levels, while test data comes from 2000 levels. ... We ran two automatic hyperparameter optimization studies using Tree-structured Parzen Estimation (TPE) (Bergstra et al., 2013) of 1024 trials. |
| Hardware Specification | No | No specific hardware details (e.g., GPU models, CPU types, memory amounts) are mentioned for the experimental setup. |
| Software Dependencies | No | We used Pytorch Paszke et al. (2017) to implement all experiments in this paper. Our codebase is publicly available at https://github.com/gibipara92/ learning-explanations-hard-to-vary. (No specific PyTorch version or other library versions are mentioned beyond the citation year for PyTorch). |
| Experiment Setup | Yes | Table 1: Hyperparameter ranges for synthetic data experiments. ... The parameters found to work best from the grid search were: agreement threshold of 1, 256 hidden units, 3 hidden layers, batch size 128, Adam with learning rate 1e-2, no batch norm, no dropout, L2-regularization with a coefficient of 1e-4, no L1-regularization. |