On the Stepwise Nature of Self-Supervised Learning

Authors: James B Simon, Maksis Knutins, Liu Ziyin, Daniel Geisz, Abraham J Fetterman, Joshua Albrecht

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, we empirically examine the training of Barlow Twins, Sim CLR, and VICReg using Res Nets with various initializations and hyperparameters and in all cases clearly observe the stepwise behavior predicted by our analytical model.
Researcher Affiliation Collaboration James B. Simon 1 2 Maksis Knutins 2 Liu Ziyin 3 Daniel Geisz 1 Abraham J. Fetterman 2 Joshua Albrecht 2 1UC Berkeley 2Generally Intelligent 3University of Tokyo. Correspondence to: James Simon <james.simon@berkeley.edu>.
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes Code to reproduce results available at https://gitlab.com/ generally-intelligent/ssl_dynamics.
Open Datasets Yes We sample n = 500 random images from CIFAR-10 (Krizhevsky, 2009) and, for each, take two random crops to size 20 20 3 to obtain n positive pairs (which thus have feature dimension m = 1200).
Dataset Splits No The paper does not specify the exact percentages or sample counts for training, validation, and test splits, nor does it refer to a standard predefined split with proper citation.
Hardware Specification No The paper mentions running experiments on a 'single GPU' and 'single consumer GPU' but does not specify the exact model (e.g., NVIDIA A100, Tesla V100, etc.) or any other hardware details like CPU, memory, or specific machine types.
Software Dependencies No The paper mentions using 'functorch' but does not provide a specific version number for it or any other software libraries or dependencies used in the experiments.
Experiment Setup Yes We train a single-hidden-layer MLP for 7000 epochs over a fixed batch of 50 images from CIFAR10 using full-batch SGD. Each image is subject to a random 20x20 crop and no other augmentations. The learning rate is η = 0.0001 and weights are scaled upon initialization by α = 0.0001. The hidden layer has width 2048 and the network output dimension is d = 10. We use Barlow Twins loss, but do not apply batch norm to the embeddings when calculating the cross-correlation matrix. λ is set to 1.