A Closer Look at Self-Supervised Lightweight Vision Transformers
Authors: Shaoru Wang, Jin Gao, Zeming Li, Xiaoqin Zhang, Weiming Hu
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we develop and benchmark several self-supervised pre-training methods on image classification tasks and some downstream dense prediction tasks. We surprisingly find that if proper pre-training is adopted, even vanilla lightweight Vi Ts show comparable performance to previous SOTA networks with delicate architecture design. We use Vi T-Tiny (Touvron et al., 2021a) in our study to examine the effect of the pre-training on downstream performance, which contains 5.7M parameters. Evaluation Metrics. We adopt fine-tuning as the default evaluation protocol considering that it is highly correlated with utility (Newell & Deng, 2020), in which all the layers are tuned by initializing them with the pre-trained models. By default, we do the evaluation on Image Net (Deng et al., 2009) by fine-tuning on the training set and evaluating on the validation set. |
| Researcher Affiliation | Collaboration | Shaoru Wang 1 2 Jin Gao 1 2 Zeming Li 3 Xiaoqin Zhang 4 Weiming Hu 1 2 5 1State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences 2School of Artificial Intelligence, University of Chinese Academy of Sciences 3Megvii Research 4Key Laboratory of Intelligent Informatics for Safety & Emergency of Zhejiang Province, Wenzhou University 5School of Information Science and Technology, Shanghai Tech University. |
| Pseudocode | No | No pseudocode or clearly labeled algorithm blocks were found. The methods are described in prose and mathematical equations. |
| Open Source Code | Yes | Code is available at https://github.com/wangsr126/mae-lite. |
| Open Datasets | Yes | By default, we do the evaluation on Image Net (Deng et al., 2009) by fine-tuning on the training set and evaluating on the validation set. Several other downstream classification datasets (e.g., Flowers (Nilsback & Zisserman, 2008), Aircraft (Maji et al., 2013), CIFAR100 (Krizhevsky et al., 2009), etc.) and object detection and segmentation tasks on COCO (Lin et al., 2014) are also exploited for comparison. For all these datasets except i Nat18, we fine-tune with SGD (momentum=0.9), and the batch size is set to 512. |
| Dataset Splits | Yes | By default, we do the evaluation on Image Net (Deng et al., 2009) by fine-tuning on the training set and evaluating on the validation set. The description of each dataset is represented as (train-size/test-size/#classes). We further consider two subsets of IN1K containing 1% and 10% of the total examples (1% IN1K and 10% IN1K) balanced in terms of classes (Assran et al., 2021) and one subset with long-tailed class distribution (Liu et al., 2019) (IN1K-LT). |
| Hardware Specification | Yes | The pre-training time is measured on 8 V100 GPU machine. The throughput is borrowed from timm (Wightman, 2019), which is measured on a single RTX 3090 GPU with a batch size fixed to 1024 and mixed precision. |
| Software Dependencies | No | While the paper mentions using specific optimizers (AdamW), learning rate schedules (cosine decay), and data augmentation techniques (RandAug, mixup, cutmix), it does not provide specific version numbers for software libraries or dependencies (e.g., PyTorch version, timm library version). |
| Experiment Setup | Yes | We specifically use Vi T-Tiny (Touvron et al., 2021a) in our study to examine the effect of the pre-training on downstream performance, which contains 5.7M parameters. The number of heads is increased to 12. We find MAE prefers a much more lightweight decoder when the encoder is small, thus a decoder with only one Transformer block is adopted by default and the width is 192. We sweep over 5 masking ratios {0.45, 0.55, 0.65, 0.75, 0.85} and find 0.75 achieves the best performance. Fine-tuning evaluation settings: optimizer Adam W, base learning rate 1e-3, weight decay 0.05, optimizer momentum β1, β2 = 0.9, 0.999, layer-wise lr decay (Bao et al., 2021) 0.85 (MAE), 0.75 (Mo Co-v3), batch size 1024, learning rate schedule cosine decay, warmup epochs 5, training epochs {100, 300, 1000}, augmentation Rand Aug(10, 0.5), colorjitter 0.3, label smoothing 0, mixup 0.2, cutmix 0, drop path 0. Pre-training setting for Mo Co-v3: optimizer Adam W, base learning rate 1.5e-4, weight decay 0.1, optimizer momentum β1, β2 = 0.9, 0.999, batch size 1024, learning rate schedule cosine decay, warmup epochs 40, training epochs 400, momentum coefficient 0.99, temperature 0.2. We resize images to 224 x 224. For transfer evaluation on dense prediction tasks, we decrease the input image size from 1024 to 640 and fine-tune for up to 100 epochs, with weight decay of 0.05. |