The hidden uniform cluster prior in self-supervised learning
Authors: Mido Assran, Randall Balestriero, Quentin Duval, Florian Bordes, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Nicolas Ballas
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We explore three joint-embedding methods employing diverse collapse prevention strategies: Sim CLR (Chen et al., 2020b), VICReg (Bardes et al., 2021), and MSN (Assran et al., 2022). We also compare the performance of those models to instance-based methods such as MAE (He et al., 2021) and data2vec (Baevski et al., 2022). As can be seen in Table 1, the performance of joint-embedding methods employing volume maximization regularizers degrades significantly on all the semantic downstream tasks (IN1K, CIFAR100, Places205, Clevr/Count) when the mini-batches sampled during pretraining are not class-balanced. |
| Researcher Affiliation | Collaboration | 1Meta AI (FAIR) 2Mc Gill University, ECE 3Mila, Quebec AI Institute 4Universite de Montreal, DIRO |
| Pseudocode | No | The paper describes algorithms conceptually and mathematically, including equations for K-means and self-supervised learning losses, but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | For pretraining using existing methods, we leverage publicly available implementations along with the default hyperparameters; see Appendix D for details. For evaluation, we use the publicly available VISSL codebase (Goyal et al., 2021); specific evaluation configurations are provided in Appendix D.2. The training details for the PMSN experiments are provided in Appendix 5. MSN. We pretrain a Vi T-B/16 with MSN (Assran et al., 2022) using the Adam W optimizer with a batch size of 1024 for 300 epochs and 1024 prototypes using the official code base, which is publicly available: https://github.com/facebookresearch/msn. |
| Open Datasets | Yes | When pretrained on the Image Net dataset (Russakovsky et al., 2015), these methods have been shown to produce representations that encode highly semantic features (Caron et al., 2020; 2021; Assran et al., 2022). When pretraining on the i Naturalist 2018 dataset (Van Horn et al., 2018), which is naturally long-tailed, we demonstrate that moving away from uniform priors leads to more semantic representations and improved transfer on downstream tasks. |
| Dataset Splits | Yes | We also evaluate in-distribution performance of Image Net classification (Russakovsky et al., 2015; Chen et al., 2020b). For linear evaluation, we use the default linear evaluation configurations of VISSL (Goyal et al., 2021) to evaluate our models on the following datasets: Image Net (Russakovsky et al., 2015), i Naturalist18 (Van Horn et al., 2018), CIFAR100 (Krizhevsky et al., 2009), Clevr/Count (Johnson et al., 2017), Clevr/Dist (Johnson et al., 2017), KITTI/Dist (Geiger et al., 2013) and Places205 (Zhou et al., 2014). Table 12: In-distribution: Evaluation of the mini-batch sampling distribution on in-distribution Image Net linear evaluation using 100% of the training set. |
| Hardware Specification | No | No specific hardware details (e.g., exact GPU/CPU models, memory amounts, or detailed computer specifications) used for running experiments are provided. The paper only mentions "GPUs utilized for distributed training" without further specifics. |
| Software Dependencies | No | The paper refers to using specific codebases (e.g., VISSL, official MSN/MAE/data2vec repos) and default hyperparameters, but does not explicitly list required software dependencies with specific version numbers (e.g., "PyTorch 1.9"). |
| Experiment Setup | Yes | Sim CLR. We pretrain a Res Net-50 with Sim CLR (Chen et al., 2020b), with a batch size of 4096 for 300 epochs. Our pretraining follow the standard hyperparameters defined in Chen et al. (2020b). The learning rate follows the default cosine schedule with a 10 epoch warmup. We use a temperature of 0.1 for the contrastive loss and LARS (You et al., 2017) as an optimizer. We modify the sampler to force K different classes inside each mini-batch where K is set to 8 for the class-imbalanced sampling experiments and 960 for the class-balanced sampling experiments. |