The hidden uniform cluster prior in self-supervised learning

Authors: Mido Assran, Randall Balestriero, Quentin Duval, Florian Bordes, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Nicolas Ballas

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We explore three joint-embedding methods employing diverse collapse prevention strategies: Sim CLR (Chen et al., 2020b), VICReg (Bardes et al., 2021), and MSN (Assran et al., 2022). We also compare the performance of those models to instance-based methods such as MAE (He et al., 2021) and data2vec (Baevski et al., 2022). As can be seen in Table 1, the performance of joint-embedding methods employing volume maximization regularizers degrades significantly on all the semantic downstream tasks (IN1K, CIFAR100, Places205, Clevr/Count) when the mini-batches sampled during pretraining are not class-balanced.
Researcher Affiliation Collaboration 1Meta AI (FAIR) 2Mc Gill University, ECE 3Mila, Quebec AI Institute 4Universite de Montreal, DIRO
Pseudocode No The paper describes algorithms conceptually and mathematically, including equations for K-means and self-supervised learning losses, but does not include structured pseudocode or algorithm blocks.
Open Source Code Yes For pretraining using existing methods, we leverage publicly available implementations along with the default hyperparameters; see Appendix D for details. For evaluation, we use the publicly available VISSL codebase (Goyal et al., 2021); specific evaluation configurations are provided in Appendix D.2. The training details for the PMSN experiments are provided in Appendix 5. MSN. We pretrain a Vi T-B/16 with MSN (Assran et al., 2022) using the Adam W optimizer with a batch size of 1024 for 300 epochs and 1024 prototypes using the official code base, which is publicly available: https://github.com/facebookresearch/msn.
Open Datasets Yes When pretrained on the Image Net dataset (Russakovsky et al., 2015), these methods have been shown to produce representations that encode highly semantic features (Caron et al., 2020; 2021; Assran et al., 2022). When pretraining on the i Naturalist 2018 dataset (Van Horn et al., 2018), which is naturally long-tailed, we demonstrate that moving away from uniform priors leads to more semantic representations and improved transfer on downstream tasks.
Dataset Splits Yes We also evaluate in-distribution performance of Image Net classification (Russakovsky et al., 2015; Chen et al., 2020b). For linear evaluation, we use the default linear evaluation configurations of VISSL (Goyal et al., 2021) to evaluate our models on the following datasets: Image Net (Russakovsky et al., 2015), i Naturalist18 (Van Horn et al., 2018), CIFAR100 (Krizhevsky et al., 2009), Clevr/Count (Johnson et al., 2017), Clevr/Dist (Johnson et al., 2017), KITTI/Dist (Geiger et al., 2013) and Places205 (Zhou et al., 2014). Table 12: In-distribution: Evaluation of the mini-batch sampling distribution on in-distribution Image Net linear evaluation using 100% of the training set.
Hardware Specification No No specific hardware details (e.g., exact GPU/CPU models, memory amounts, or detailed computer specifications) used for running experiments are provided. The paper only mentions "GPUs utilized for distributed training" without further specifics.
Software Dependencies No The paper refers to using specific codebases (e.g., VISSL, official MSN/MAE/data2vec repos) and default hyperparameters, but does not explicitly list required software dependencies with specific version numbers (e.g., "PyTorch 1.9").
Experiment Setup Yes Sim CLR. We pretrain a Res Net-50 with Sim CLR (Chen et al., 2020b), with a batch size of 4096 for 300 epochs. Our pretraining follow the standard hyperparameters defined in Chen et al. (2020b). The learning rate follows the default cosine schedule with a 10 epoch warmup. We use a temperature of 0.1 for the contrastive loss and LARS (You et al., 2017) as an optimizer. We modify the sampler to force K different classes inside each mini-batch where K is set to 8 for the class-imbalanced sampling experiments and 960 for the class-balanced sampling experiments.