Implicit variance regularization in non-contrastive SSL

Authors: Manu Srinath Halvagal, Axel Laborieux, Friedemann Zenke

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, Iso Loss speeds up the initial learning dynamics and increases robustness, thereby allowing us to dispense with the exponential moving average (EMA) target network typically used with non-contrastive methods. Our analysis sheds light on the variance regularization mechanisms of non-contrastive SSL and lays the theoretical grounds for crafting novel loss functions that shape the learning dynamics of the predictor s spectrum.
Researcher Affiliation Academia Manu Srinath Halvagal1,2, Axel Laborieux1, Friedemann Zenke1,2 {firstname.lastname}@fmi.ch 1 Friedrich Miescher Institute for Biomedical Research, Basel, Switzerland 2 Faculty of Science, University of Basel, Basel, Switzerland
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes Code is available at https://github.com/fmi-basel/implicit-var-reg
Open Datasets Yes We performed several self-supervised learning experiments on CIFAR-10, CIFAR-100 [29], STL-10 [30], and Tiny Image Net [31].
Dataset Splits Yes We recorded the online readout accuracy of a linear classifier trained on frozen features following standard practice, evaluated either on the held-out validation or test set where available. [...] We reported the held-out classification accuracy on the test sets for CIFAR-10/100 and STL-10, and the validation set for Tiny Image Net, after online training of the gradient-isolated linear classifier on each labeled example in the training set during pretraining.
Hardware Specification Yes All simulations were run on an in-house cluster consisting of 5 nodes with 4 V100 NVIDIA GPUs each, one node with 4 A100 NVIDIA GPUs, and one node with 8 A40 NVIDIA GPUs.
Software Dependencies No The paper mentions using the Solo-learn library [32] but does not provide specific version numbers for it or any other software dependencies.
Experiment Setup Yes We used a projection dimension of 256 for the projection MLP using one hidden layer with 4096 units, and the same architecture for the nonlinear predictor for the BYOL baseline. For networks using EMA target networks, we used the LARS optimizer with learning rate 1.0 whereas for networks without the EMA, we used stochastic gradient descent with momentum 0.9 and learning rate 0.1. Furthermore, we used a warmup period of 10 epochs for the learning rate followed by a cosine decay schedule and a batch size of 256. We also used a weight decay 4 10 4 for the closed-form predictor models and 10 5 for the nonlinear predictor models. For the EMA, we started with τbase = 0.99 and increased τEMA to 1 with a cosine schedule exactly following the configuration reported in [9]. For Direct Pred, we used α = 0.5 and τ = 0.998 for the moving average estimate of the correlation matrix updated at every step.