On the Efficacy of Small Self-Supervised Contrastive Models without Distillation Signals
Authors: Haizhou Shi, Youcai Zhang, Siliang Tang, Wenjie Zhu, Yaqian Li, Yandong Guo, Yueting Zhuang2225-2234
AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We first evaluate the representation spaces of the small models and make two non-negligible observations: (i) the small models can complete the pretext task without overfitting despite their limited capacity and (ii) they universally suffer the problem of over clustering. Then we verify multiple assumptions that are considered to alleviate the over-clustering phenomenon. Finally, we combine the validated techniques and improve the baseline performances of five small architectures with considerable margins, which indicates that training small self-supervised contrastive models is feasible even without distillation signals. |
| Researcher Affiliation | Collaboration | 1 OPPO Research Institute 2 Zhejiang University 3 New York University |
| Pseudocode | No | The paper describes methods and uses mathematical equations, but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code is available at https://github.com/WOWNICE/ssl-small. |
| Open Datasets | Yes | All the metrics are evaluated on the penultimate output of the networks (refer to Tab.2) and the Image Net1k dataset (Deng et al. 2009). ... We benchmark the transferability of the backbone networks on CIFAR10, CIFAR100 (Krizhevsky 2009), and Caltech101 (Fei-Fei, Fergus, and Perona 2004) image classification datasets. |
| Dataset Splits | Yes | We sample 50 images per class for both the training set and the validation sets, making it a 50, 000-way classification task for the pre-trained models. ... For the small models, there is no overfitting problem when trained on the pretext task. This conclusion is supported by the fact that each model s metrics have no significant difference on both the training and validation sets. |
| Hardware Specification | Yes | The training times are evaluated on a single 8-card V100 GPU server for 200 epochs of training. |
| Software Dependencies | No | The paper mentions basing research on the "Mo Co V2 algorithm" but does not specify any software libraries or frameworks with version numbers (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | To better utilize the computational resource, we set the batch size as 1024, and the learning rate as 0.06. ... We set temperature τ = 0.1, batch size B = 512. learning rate η = 0.06, and negative sample size K = 65536. ... We train all the models for 800 epochs with cosine decay, and evaluate them at epoch 200 and epoch 800. |