On the Efficacy of Small Self-Supervised Contrastive Models without Distillation Signals

Authors: Haizhou Shi, Youcai Zhang, Siliang Tang, Wenjie Zhu, Yaqian Li, Yandong Guo, Yueting Zhuang2225-2234

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We first evaluate the representation spaces of the small models and make two non-negligible observations: (i) the small models can complete the pretext task without overfitting despite their limited capacity and (ii) they universally suffer the problem of over clustering. Then we verify multiple assumptions that are considered to alleviate the over-clustering phenomenon. Finally, we combine the validated techniques and improve the baseline performances of five small architectures with considerable margins, which indicates that training small self-supervised contrastive models is feasible even without distillation signals.
Researcher Affiliation Collaboration 1 OPPO Research Institute 2 Zhejiang University 3 New York University
Pseudocode No The paper describes methods and uses mathematical equations, but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes The code is available at https://github.com/WOWNICE/ssl-small.
Open Datasets Yes All the metrics are evaluated on the penultimate output of the networks (refer to Tab.2) and the Image Net1k dataset (Deng et al. 2009). ... We benchmark the transferability of the backbone networks on CIFAR10, CIFAR100 (Krizhevsky 2009), and Caltech101 (Fei-Fei, Fergus, and Perona 2004) image classification datasets.
Dataset Splits Yes We sample 50 images per class for both the training set and the validation sets, making it a 50, 000-way classification task for the pre-trained models. ... For the small models, there is no overfitting problem when trained on the pretext task. This conclusion is supported by the fact that each model s metrics have no significant difference on both the training and validation sets.
Hardware Specification Yes The training times are evaluated on a single 8-card V100 GPU server for 200 epochs of training.
Software Dependencies No The paper mentions basing research on the "Mo Co V2 algorithm" but does not specify any software libraries or frameworks with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes To better utilize the computational resource, we set the batch size as 1024, and the learning rate as 0.06. ... We set temperature τ = 0.1, batch size B = 512. learning rate η = 0.06, and negative sample size K = 65536. ... We train all the models for 800 epochs with cosine decay, and evaluate them at epoch 200 and epoch 800.