Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Self-Supervised Learning with Data Augmentations Provably Isolates Content from Style
Authors: Julius von Kügelgen, Yash Sharma, Luigi Gresele, Wieland Brendel, Bernhard Schölkopf, Michel Besserve, Francesco Locatello
NeurIPS 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We perform two main experiments. First, we numerically test our main result, Thm. 4.4, in a fullycontrolled, finite sample setting ( 5.1), using CL to estimate the entropy term in (5). Second, we seek to better understand the effect of data augmentations used in practice ( 5.2). |
| Researcher Affiliation | Collaboration | 1 Max Planck Institute for Intelligent Systems Tübingen 2 University of Cambridge 3 Tübingen AI Center, University of Tübingen 4 IMPRS for Intelligent Systems 5 Amazon |
| Pseudocode | No | No pseudocode or algorithm blocks were found in the paper. |
| Open Source Code | Yes | Code available at: https://www.github.com/ysharma1126/ssl_identifiability |
| Open Datasets | Yes | We made the Causal3DIdent dataset publicly available at this URL. |
| Dataset Splits | No | The paper does not explicitly state specific training, validation, or test dataset splits (e.g., percentages or sample counts) within the provided text. |
| Hardware Specification | No | The paper mentions that compute and resources are 'Provided in Appendix D', but these details are not present in the provided text. |
| Software Dependencies | No | The paper mentions that software dependencies are 'Provided in Appendix D', but specific version numbers for key software components are not present in the provided text. |
| Experiment Setup | Yes | Experimental setup. We generate synthetic data as described in 3. We consider nc = ns = 5, with content and style latents distributed as c N(0, c) and s|c N(a + Bc, s), thus allowing for statistical dependence within the two blocks (via c and s) and causal dependence between content and style (via B). For f, we use a 3-layer MLP with Leaky Re LU activation functions. Experimental setup. For g, we train a convolutional encoder composed of a Res Net18 [46] and an additional fully-connected layer, with Leaky Re LU activation. As in Sim CLR [20], we use Info NCE with cosine similarity, and train on pairs of augmented examples ( x, x0). As nc is unknown and variable depending on the augmentation, we fix dim(ˆc) = 8 throughout. |