Effective Self-supervised Pre-training on Low-compute Networks without Distillation
Authors: Fuwen Tan, Fatemeh Sadat Saleh, Brais Martinez
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We pre-train these methods on Image Net1K (Russakovsky et al., 2015) using Mobile Net V2 (Sandler et al., 2018) as the backbone. Mo Co-v2 does not use multiple crops by default. Following (Gansbeke et al., 2021), we re-implement a variant of Mo Co-v2 with multiple crops, noted as Mo Co-v2 . All models are trained for 200 epochs with a batch size of 1024. In Table 1, we compare the performance of the SSL methods against the supervised pre-training, as well as the distillation-based model Sim Reg (Navaneet et al., 2021), which is the current state-of-the-art method. |
| Researcher Affiliation | Industry | Fuwen Tan Samsung AI Cambridge Fatemeh Saleh Microsoft Research Cambridge Brais Martinez Samsung AI Cambridge |
| Pseudocode | No | The paper describes methods in prose and tables but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is publicly available at github.com/saic-fi/SSLight. |
| Open Datasets | Yes | We pre-train these methods on Image Net1K (Russakovsky et al., 2015)... |
| Dataset Splits | Yes | For the evaluation, we reserve a random subset of 50,000 images from the training set of Image Net1K, i.e. 50 per category, and report the linear accuracy on this subset to avoid over-fitting, only using the Image Net1K validation set in the final experiments in Sec. 4. |
| Hardware Specification | No | The paper mentions training on 'an 8-GPU server' and 'hardware limitations' leading to smaller batch sizes for some models, but does not provide specific details such as GPU model numbers, CPU types, or memory. |
| Software Dependencies | No | The paper mentions optimizers like LARS, Adam W, and SGD, and frameworks such as Detectron2 and Torchvision, but it does not specify their version numbers. |
| Experiment Setup | Yes | All models are trained for 200 epochs with a batch size of 1024. Following Caron et al. (2021), we use the LARS optimizer (You et al., 2017) for the convolution-based networks, and Adam W (Loshchilov & Hutter, 2019) for the vision transformers. We use a batch size of 1024 and a linear learning rate warm-up in the first 10 epochs. After the warm-up, the learning rate decays with a cosine schedule (Loshchilov & Hutter, 2017). Most of the other hyper-parameters are inherited from the original literature. We provide further details in Appendix Sec. A. |