Effective Self-supervised Pre-training on Low-compute Networks without Distillation

Authors: Fuwen Tan, Fatemeh Sadat Saleh, Brais Martinez

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We pre-train these methods on Image Net1K (Russakovsky et al., 2015) using Mobile Net V2 (Sandler et al., 2018) as the backbone. Mo Co-v2 does not use multiple crops by default. Following (Gansbeke et al., 2021), we re-implement a variant of Mo Co-v2 with multiple crops, noted as Mo Co-v2 . All models are trained for 200 epochs with a batch size of 1024. In Table 1, we compare the performance of the SSL methods against the supervised pre-training, as well as the distillation-based model Sim Reg (Navaneet et al., 2021), which is the current state-of-the-art method.
Researcher Affiliation Industry Fuwen Tan Samsung AI Cambridge Fatemeh Saleh Microsoft Research Cambridge Brais Martinez Samsung AI Cambridge
Pseudocode No The paper describes methods in prose and tables but does not include any structured pseudocode or algorithm blocks.
Open Source Code Yes Code is publicly available at github.com/saic-fi/SSLight.
Open Datasets Yes We pre-train these methods on Image Net1K (Russakovsky et al., 2015)...
Dataset Splits Yes For the evaluation, we reserve a random subset of 50,000 images from the training set of Image Net1K, i.e. 50 per category, and report the linear accuracy on this subset to avoid over-fitting, only using the Image Net1K validation set in the final experiments in Sec. 4.
Hardware Specification No The paper mentions training on 'an 8-GPU server' and 'hardware limitations' leading to smaller batch sizes for some models, but does not provide specific details such as GPU model numbers, CPU types, or memory.
Software Dependencies No The paper mentions optimizers like LARS, Adam W, and SGD, and frameworks such as Detectron2 and Torchvision, but it does not specify their version numbers.
Experiment Setup Yes All models are trained for 200 epochs with a batch size of 1024. Following Caron et al. (2021), we use the LARS optimizer (You et al., 2017) for the convolution-based networks, and Adam W (Loshchilov & Hutter, 2019) for the vision transformers. We use a batch size of 1024 and a linear learning rate warm-up in the first 10 epochs. After the warm-up, the learning rate decays with a cosine schedule (Loshchilov & Hutter, 2017). Most of the other hyper-parameters are inherited from the original literature. We provide further details in Appendix Sec. A.