SemPPL: Predicting Pseudo-Labels for Better Contrastive Representations

Authors: Matko Bošnjak, Pierre Harvey Richemond, Nenad Tomasev, Florian Strub, Jacob C Walker, Felix Hill, Lars Holger Buesing, Razvan Pascanu, Charles Blundell, Jovana Mitrovic

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental SEMPPL outperforms competing semisupervised methods setting new state-of-the-art performance of 68.5% and 76% top-1 accuracy when using a Res Net-50 and training on 1% and 10% of labels on Image Net, respectively. Furthermore, when using selective kernels, SEMPPL significantly outperforms previous state-of-the-art achieving 72.3% and 78.3% top-1 accuracy on Image Net with 1% and 10% labels, respectively, which improves absolute +7.8% and +6.2% over previous work.
Researcher Affiliation Industry Deep Mind {matko, richemond, mitrovic}@deepmind.com
Pseudocode Yes C PSEUDO-CODE OF SEMPPL Listing 1 provides PyTorch-like pseudo-code for SEMPPL detailing how we compute pseudo-labels and use them to select the additional semantic positives, which are then used in the contrastive loss, along the augmentation positives.
Open Source Code Yes We release the checkpoints and the evaluation code at https://github.com/deepmind/semppl.
Open Datasets Yes To evaluate SEMPPL, we pre-train representations using 1% and 10% labelled data from the Image Net dataset [Russakovsky et al., 2015] based on the splits from Chen et al. [2020a].
Dataset Splits Yes We train all the weights (pretrained and classifier weights) using either 1% or 10% of the Image Net-1k training data, and we use the splits introduced in Chen et al. [2020a] and used in all the methods to compare to Grill et al. [2020]; Caron et al. [2020]; Dwibedi et al. [2021]; Lee et al. [2021]; Mitrovic et al. [2021]; Tomasev et al. [2022]; Assran et al. [2021]. models are initially trained on the training sets of the individual datasets, and the validation sets are used to select the best hyperparameters from the executed hyperparameter sweeps.
Hardware Specification Yes Our final networks were optimized using tranches of between 128 (for a Res Net-50) and 512 (for the largest Res Nets) Cloud TPUv3s all during 300 epochs each irrespective of size.
Software Dependencies No The paper mentions PyTorch-like pseudo-code and various algorithms/models (e.g., ResNet, SIMCLR, RELICv2) and optimizers (LARS), but it does not specify version numbers for any software libraries, frameworks, or dependencies used.
Experiment Setup Yes Algorithm parameters We use a queue of capacity C = 20B, with batch size B = 4096, and temperature τ = 0.2 while randomly sampling negatives from the current batch; we take |N(x)| = 10 negatives in total. For augmentations, we use the standard SIMCLR augmentations [Chen et al., 2020a] and the RELICV2 multi-crop and saliency-based masking [Tomasev et al., 2022]; we use 4 large views and 2 small views for augmentation positives and 3 semantic positives. The semantic positives are computed with a k-NN with k = 1 (see the analysis section in Appendix D); we build a single k-NN instance per augmentation a queried with all the augmentations where |a| = 4. This produces |a|2 = 16 k-NN induced pseudo-labels in total for each unlabelled image among which we then perform majority voting to compute the final pseudo-label. Optimisation Our networks are optimized with LARS [You et al., 2017]. Our base learning rate is 0.3 and we train our models for 300 epochs with a learning rate warm-up period of 10 epochs and cosine decay schedule thereafter. We use a weight decay of 10 6 and batch size B = 4096. We exclude the biases and batch normalisation parameters both from LARS adaptation and weight decay. The exponential moving average parameter for target networks is 0.996.