Implicit variance regularization in non-contrastive SSL
Authors: Manu Srinath Halvagal, Axel Laborieux, Friedemann Zenke
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, Iso Loss speeds up the initial learning dynamics and increases robustness, thereby allowing us to dispense with the exponential moving average (EMA) target network typically used with non-contrastive methods. Our analysis sheds light on the variance regularization mechanisms of non-contrastive SSL and lays the theoretical grounds for crafting novel loss functions that shape the learning dynamics of the predictor s spectrum. |
| Researcher Affiliation | Academia | Manu Srinath Halvagal1,2, Axel Laborieux1, Friedemann Zenke1,2 {firstname.lastname}@fmi.ch 1 Friedrich Miescher Institute for Biomedical Research, Basel, Switzerland 2 Faculty of Science, University of Basel, Basel, Switzerland |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available at https://github.com/fmi-basel/implicit-var-reg |
| Open Datasets | Yes | We performed several self-supervised learning experiments on CIFAR-10, CIFAR-100 [29], STL-10 [30], and Tiny Image Net [31]. |
| Dataset Splits | Yes | We recorded the online readout accuracy of a linear classifier trained on frozen features following standard practice, evaluated either on the held-out validation or test set where available. [...] We reported the held-out classification accuracy on the test sets for CIFAR-10/100 and STL-10, and the validation set for Tiny Image Net, after online training of the gradient-isolated linear classifier on each labeled example in the training set during pretraining. |
| Hardware Specification | Yes | All simulations were run on an in-house cluster consisting of 5 nodes with 4 V100 NVIDIA GPUs each, one node with 4 A100 NVIDIA GPUs, and one node with 8 A40 NVIDIA GPUs. |
| Software Dependencies | No | The paper mentions using the Solo-learn library [32] but does not provide specific version numbers for it or any other software dependencies. |
| Experiment Setup | Yes | We used a projection dimension of 256 for the projection MLP using one hidden layer with 4096 units, and the same architecture for the nonlinear predictor for the BYOL baseline. For networks using EMA target networks, we used the LARS optimizer with learning rate 1.0 whereas for networks without the EMA, we used stochastic gradient descent with momentum 0.9 and learning rate 0.1. Furthermore, we used a warmup period of 10 epochs for the learning rate followed by a cosine decay schedule and a batch size of 256. We also used a weight decay 4 10 4 for the closed-form predictor models and 10 5 for the nonlinear predictor models. For the EMA, we started with τbase = 0.99 and increased τEMA to 1 with a cosine schedule exactly following the configuration reported in [9]. For Direct Pred, we used α = 0.5 and τ = 0.998 for the moving average estimate of the correlation matrix updated at every step. |