The SSL Interplay: Augmentations, Inductive Bias, and Generalization

Authors: Vivien Cabannes, Bobak Kiani, Randall Balestriero, Yann Lecun, Alberto Bietti

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our study focuses on the joint-embedding framework and characterizes learned representations for given choices of input distributions, data augmentations, and architecture. To obtain a fine-grained picture, we study linear classes of functions endowed with a reproducing kernel, and analyze a theoretically friendly loss function that models both contrastive and non-contrastive methods. Our work generalizes the discrete data setting of Hao Chen et al. (2021) and the finite dimensional setting of Saunshi et al. (2022), encompassing more expressive nonparametric models, potentially with universal approximation properties, and which can capture certain properties of architectures through their limiting kernel limits (Jacot et al., 2018).
Researcher Affiliation Collaboration 1Meta AI, New York, NY, USA 2MIT Department of Electrical Engineering and Computer Science, Cambridge, MA, USA. Correspondence to: Vivien Cabannes <vivc@meta.com>.
Pseudocode No The paper describes methods mathematically and textually but does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code No The paper does not provide an unambiguous statement or a direct link indicating that its source code is publicly available.
Open Datasets Yes The dataset we consider is the halfmoon dataset, where X = Z + 1 Z,e1 >0e2 + U, Z U S2 , and U N(0, σ2I) for σ = 0.1. Augmentations apply Gaussian noise, ξ = X + V for V N(0, σ2I) with σ = 0.1.
Dataset Splits No The paper mentions training and testing samples, but does not explicitly define a separate validation split by percentage, count, or a reference to a predefined validation set.
Hardware Specification No The paper does not specify any particular hardware used for experiments, such as CPU or GPU models, or cloud computing instance types.
Software Dependencies No The paper does not list specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes Pretraining uses all 3 augmentations for each sample, with a representation dimension k = 20. The downstream problem is solved with kernel ridge regression using the induced kernel from pretraining, and the ridge parameter is tuned on test samples to avoid dealing with model selection issues. In our experiments, we fixed λ = 10 3 and the scale of the exponential kernel σ to be about one fifth of the problem diameter. We plot the eigenfunctions of Tλ estimated empirically with npre = 2000 samples in Figure 13. The classification tasks aims to learn the four classes described on the left of Figure 12. Class labels include some noise as indicated by the level lines of the conditional probability of Y as a function of X shown in the middle of Figure 12. A training set example is shown on the right of this figure with ndown = 100. In the experiments we fix k = 5, which ensures that there is strong correlation in performance between the pretraining and downstream tasks. The downstream task is optimized with a least-squares surrogate: we learn g : X R4 that minimizes the least-square error E[ g(X) e Y 2] before decoding it as f(X) = arg maxi [4] gi(X) to get an estimate of the ideal mapping f : X Y. We report the downstream generalization error on both the least-squares (surrogate) loss and the 0-1 loss on Figure 14. This error is computed as the average over 100 trials on the pretraining task and 200 trials on the downstream task.