SemPPL: Predicting Pseudo-Labels for Better Contrastive Representations
Authors: Matko Bošnjak, Pierre Harvey Richemond, Nenad Tomasev, Florian Strub, Jacob C Walker, Felix Hill, Lars Holger Buesing, Razvan Pascanu, Charles Blundell, Jovana Mitrovic
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | SEMPPL outperforms competing semisupervised methods setting new state-of-the-art performance of 68.5% and 76% top-1 accuracy when using a Res Net-50 and training on 1% and 10% of labels on Image Net, respectively. Furthermore, when using selective kernels, SEMPPL significantly outperforms previous state-of-the-art achieving 72.3% and 78.3% top-1 accuracy on Image Net with 1% and 10% labels, respectively, which improves absolute +7.8% and +6.2% over previous work. |
| Researcher Affiliation | Industry | Deep Mind {matko, richemond, mitrovic}@deepmind.com |
| Pseudocode | Yes | C PSEUDO-CODE OF SEMPPL Listing 1 provides PyTorch-like pseudo-code for SEMPPL detailing how we compute pseudo-labels and use them to select the additional semantic positives, which are then used in the contrastive loss, along the augmentation positives. |
| Open Source Code | Yes | We release the checkpoints and the evaluation code at https://github.com/deepmind/semppl. |
| Open Datasets | Yes | To evaluate SEMPPL, we pre-train representations using 1% and 10% labelled data from the Image Net dataset [Russakovsky et al., 2015] based on the splits from Chen et al. [2020a]. |
| Dataset Splits | Yes | We train all the weights (pretrained and classifier weights) using either 1% or 10% of the Image Net-1k training data, and we use the splits introduced in Chen et al. [2020a] and used in all the methods to compare to Grill et al. [2020]; Caron et al. [2020]; Dwibedi et al. [2021]; Lee et al. [2021]; Mitrovic et al. [2021]; Tomasev et al. [2022]; Assran et al. [2021]. models are initially trained on the training sets of the individual datasets, and the validation sets are used to select the best hyperparameters from the executed hyperparameter sweeps. |
| Hardware Specification | Yes | Our final networks were optimized using tranches of between 128 (for a Res Net-50) and 512 (for the largest Res Nets) Cloud TPUv3s all during 300 epochs each irrespective of size. |
| Software Dependencies | No | The paper mentions PyTorch-like pseudo-code and various algorithms/models (e.g., ResNet, SIMCLR, RELICv2) and optimizers (LARS), but it does not specify version numbers for any software libraries, frameworks, or dependencies used. |
| Experiment Setup | Yes | Algorithm parameters We use a queue of capacity C = 20B, with batch size B = 4096, and temperature τ = 0.2 while randomly sampling negatives from the current batch; we take |N(x)| = 10 negatives in total. For augmentations, we use the standard SIMCLR augmentations [Chen et al., 2020a] and the RELICV2 multi-crop and saliency-based masking [Tomasev et al., 2022]; we use 4 large views and 2 small views for augmentation positives and 3 semantic positives. The semantic positives are computed with a k-NN with k = 1 (see the analysis section in Appendix D); we build a single k-NN instance per augmentation a queried with all the augmentations where |a| = 4. This produces |a|2 = 16 k-NN induced pseudo-labels in total for each unlabelled image among which we then perform majority voting to compute the final pseudo-label. Optimisation Our networks are optimized with LARS [You et al., 2017]. Our base learning rate is 0.3 and we train our models for 300 epochs with a learning rate warm-up period of 10 epochs and cosine decay schedule thereafter. We use a weight decay of 10 6 and batch size B = 4096. We exclude the biases and batch normalisation parameters both from LARS adaptation and weight decay. The exponential moving average parameter for target networks is 0.996. |