Barlow Twins: Self-Supervised Learning via Redundancy Reduction

Authors: Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, Stephane Deny

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental BARLOW TWINS outperforms previous methods on Image Net for semi-supervised classification in the low-data regime, and is on par with current state of the art for Image Net classification with a linear classifier head, and for transfer tasks of classification and object detection.
Researcher Affiliation Collaboration 1Facebook AI Research 2New York University, NY, USA. Correspondence to: Jure Zbontar <jzb@fb.com>, Li Jing <ljng@fb.com>, Ishan Misra <imisra@fb.com>, Yann Le Cun <yann@fb.com>, St ephane Deny <stephane.deny.pro@gmail.com>.
Pseudocode Yes The pseudocode for BARLOW TWINS is shown as Algorithm 1.
Open Source Code Yes Code and pre-trained models (in Py Torch) are available at https://github.com/facebookresearch/barlowtwins
Open Datasets Yes Our network is pretrained using self-supervised learning on the training set of the Image Net ILSVRC-2012 dataset (Deng et al., 2009) (without labels).
Dataset Splits Yes The top-1 and top-5 accuracies obtained on the Image Net validation set are reported in Table 1.
Hardware Specification Yes Training is distributed across 32 V100 GPUs and takes approximately 124 hours.
Software Dependencies No The paper mentions 'Py Torch-style pseudocode' and that 'Code and pre-trained models (in Py Torch) are available', indicating the use of PyTorch. However, it does not specify version numbers for PyTorch or any other software dependencies.
Experiment Setup Yes We use the LARS optimizer (You et al., 2017) and train for 1000 epochs with a batch size of 2048. We use a learning rate of 0.2 for the weights and 0.0048 for the biases and batch normalization parameters. We multiply the learning rate by the batch size and divide it by 256. We use a learning rate warm-up period of 10 epochs, after which we reduce the learning rate by a factor of 1000 using a cosine decay schedule (Loshchilov & Hutter, 2016). We ran a search for the trade-off parameter λ of the loss function and found the best results for λ = 5 × 10−3. We use a weight decay parameter of 1.5 × 10−6.