VICRegL: Self-Supervised Learning of Local Visual Features
Authors: Adrien Bardes, Jean Ponce, Yann LeCun
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate strong performance on linear classification and segmentation transfer tasks. Our evaluation is (mostly) done in the setting where the backbone learned by VICReg L is frozen, with only a linear classification or segmentation head tuned to the task at hand. Our results show that learning local features, in addition to global features, does not hurt the classification performance, but significantly improves segmentation accuracy. On the Pascal VOC linear frozen semantic segmentation task, VICReg L achieves 55.9 m Io U with a Res Net-50 backbone, which is a +8.1 m Io U improvement over VICReg, and 67.5 m Io U with a Conv Ne Xt-S backbone, which is a +6.6 m Io U improvement. |
| Researcher Affiliation | Collaboration | Adrien Bardes1,2 Jean Ponce2,4 Yann Le Cun1,3,4 1Meta, FAIR 2Inria, École normale supérieure, CNRS, PSL Research University 3Courant Institute, New York University 4Center for Data Science, New York University |
| Pseudocode | No | The paper does not contain any pseudocode or clearly labeled algorithm blocks. It provides equations and a diagram (Figure 1) to describe the method. |
| Open Source Code | Yes | Code and pretrained models are publicly available at: https://github.com/facebookresearch/VICReg L |
| Open Datasets | Yes | All the models are pretrained on the 1000-class unlabelled Image Net dataset. We evaluate the representations obtained after pretraining VICReg L with a Res Net-50, and Conv Ne Xt backbones [Liu et al., 2022] of various size, on linear classification on Image Net1k [Deng et al., 2009], and linear semantic segmentation on Pacal VOC [Everingham et al., 2010], Cityscapes [Cordts et 24 al., 2016] and ADE20k [Zhou et al., 2019]. |
| Dataset Splits | No | The paper mentions evaluating on the 'validation set of Image Net' in Table 1 and averaging results over '3 runs with randomly initialized parameters'. However, it does not explicitly provide the specific percentages or sample counts for the training, validation, and test splits used across all datasets, nor does it cite a specific source for these splits. |
| Hardware Specification | Yes | With the Res Net-50 backbone, we train our models on 32 Nvidia Tesla V100-32Gb GPUs... we therefore train our Conv Ne Xt-S models on 8 Nvidia Tesla V100-32Gb GPUs... Conv Ne Xt-B models on 16 Nvidia Tesla V100-32Gb GPUs |
| Software Dependencies | No | The paper mentions optimizers like LARS and AdamW, but it does not specify version numbers for any key software components (e.g., Python, PyTorch, CUDA, or other libraries). |
| Experiment Setup | Yes | Most hyper-parameters are kept unchanged compared to the implementation provided by [Bardes et al., 2022], the VICReg loss variance, invariance and covariance coefficients are set to 25, 25 and 1. With the Res Net-50 backbone, we train our models on 32 Nvidia Tesla V100-32Gb GPUs, with the LARS optimizer [You et al., 2017, Goyal et al., 2017], a weight decay of 10 6, a batch size of 2048 and a learning rate of 0.1. The learning rate follows a cosine decay schedule [Loshchilov and Hutter, 2017], starting from 0 with 10 warmup epochs and with final value of 0.002. The number of selected best matches γ of Eq. (2) and (3) is set to 20. With Conv Ne Xts backbones, we noticed that much smaller batch sizes actually improve the performance, we therefore train our Conv Ne Xt-S models on 8 Nvidia Tesla V100-32Gb GPUs, with the Adam W optimizer [Loshchilov and Hutter, 2019], a weight decay of 10 6, a batch size of 384 and a learning rate of 0.001, and our Conv Ne Xt-B models on 16 Nvidia Tesla V100-32Gb GPUs with a batch size of 572 and the same other hyper-parameters. The learning rate follows a cosine decay schedule, starting from 0 with 10 warmup epochs and with final value of 0.00001. The number of selected best matches γ1 and γ2 of Eq. (5) are set to 20 for feature maps from large views and 4 for feature maps from small views. |