Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Visual Representation Learning Does Not Generalize Strongly Within the Same Domain
Authors: Lukas Schott, Julius Von Kügelgen, Frederik Träuble, Peter Vincent Gehler, Chris Russell, Matthias Bethge, Bernhard Schölkopf, Francesco Locatello, Wieland Brendel
ICLR 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In total, we train and test 2000+ models and observe that all of them struggle to learn the underlying mechanism regardless of supervision signal and architectural bias. |
| Researcher Affiliation | Collaboration | 1University of Tübingen, 2Max Planck Institute for Intelligent Systems, Tübingen 3University of Cambridge, 4Amazon Web Services |
| Pseudocode | No | The paper describes the experimental setup and training details in textual paragraphs (e.g., Section 4, 5, and H.3), but does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | To this end, all data sets and evaluation scripts are released alongside a leaderboard on Git Hub. 1https://github.com/bethgelab/In Domain Generalization Benchmark |
| Open Datasets | Yes | We consider datasets with images generated from a set of discrete Factors of Variation (Fo Vs)... d Sprites (Matthey et al., 2017), ... Shapes3D (Kim & Mnih, 2018), ... MPI3D (Gondal et al., 2019)... |
| Dataset Splits | No | We further control all considered splits and datasets such that 30% of the available data is in the training set Dtr and the remaining 70% belong to the test set Dte. The paper explicitly defines train and test splits but does not specify a separate, dedicated validation set percentage or size for reproduction. |
| Hardware Specification | Yes | All models are run on the NVIDIA T4 Tensor Core GPUs on the AWS g4dn.4xlarge instances with an approximate total compute of 20 000 GPUh. |
| Software Dependencies | Yes | All models are implemented using Py Torch 1.7. |
| Experiment Setup | Yes | All fully supervised models are trained with the same training scheme. We use the Adam optimizer with a learning rate of 0.0005. ... We train each model with three random seeds for 500, 000 iterations with a batch size of b = 64. As a loss function, we consider the mean squared error MSE... |