A Simple Framework for Contrastive Learning of Visual Representations

Authors: Ting Chen, Simon Kornblith, Mohammad Norouzi, Geoffrey Hinton

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Most of our study for unsupervised pretraining (learning encoder network f without labels) is done using the Image Net ILSVRC-2012 dataset (Russakovsky et al., 2015). Some additional pretraining experiments on CIFAR-10 (Krizhevsky & Hinton, 2009) can be found in Appendix B.9. We also test the pretrained results on a wide range of datasets for transfer learning. To evaluate the learned representations, we follow the widely used linear evaluation protocol (Zhang et al., 2016; Oord et al., 2018; Bachman et al., 2019), where a linear classifier is trained on top of the frozen base network, and test accuracy is used as a proxy for representation quality.
Researcher Affiliation Industry Ting Chen 1 Simon Kornblith 1 Mohammad Norouzi 1 Geoffrey Hinton 1 1Google Research, Brain Team. Correspondence to: Ting Chen <iamtingchen@google.com>.
Pseudocode Yes Algorithm 1 Sim CLR s main learning algorithm.
Open Source Code Yes 1Code available at https://github.com/google-research/simclr.
Open Datasets Yes Most of our study for unsupervised pretraining (learning encoder network f without labels) is done using the Image Net ILSVRC-2012 dataset (Russakovsky et al., 2015). Some additional pretraining experiments on CIFAR-10 (Krizhevsky & Hinton, 2009) can be found in Appendix B.9.
Dataset Splits Yes Following Kornblith et al. (2019), we perform hyperparameter tuning for each model-dataset combination and select the best hyperparameters on a validation set.
Hardware Specification Yes We train our model with Cloud TPUs, using 32 to 128 cores depending on the batch size.2 With 128 TPU v3 cores, it takes 1.5 hours to train our Res Net-50 with a batch size of 4096 for 100 epochs.
Software Dependencies No The information is insufficient. The paper mentions the use of 'LARS optimizer' and 'Res Net' architecture, but does not provide specific software dependencies with version numbers.
Experiment Setup Yes Default setting. Unless otherwise specified, for data augmentation we use random crop and resize (with random flip), color distortions, and Gaussian blur (for details, see Appendix A). We use Res Net-50 as the base encoder network, and a 2-layer MLP projection head to project the representation to a 128-dimensional latent space. As the loss, we use NT-Xent, optimized using LARS with learning rate of 4.8 (= 0.3 Batch Size/256) and weight decay of 10 6. We train at batch size 4096 for 100 epochs.3 Furthermore, we use linear warmup for the first 10 epochs, and decay the learning rate with the cosine decay schedule without restarts (Loshchilov & Hutter, 2016).