ScaleKD: Strong Vision Transformers Could Be Excellent Teachers
Authors: Jiawei Fan, Chao Li, Xiaolong Liu, Anbang Yao
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | By combining three closely coupled components namely cross attention projector, dual-view feature mimicking and teacher parameter perception tailored to address the alignment problems stated above, we present a simple and effective knowledge distillation method, called Scale KD. Our method can train student backbones that span across a variety of convolutional neural network (CNN), multi-layer perceptron (MLP), and Vi T architectures on image classification datasets, achieving state-of-the-art knowledge distillation performance. For instance, taking a well pre-trained Swin-L as the teacher model, our method gets 75.15%|82.03%|84.16%|78.63%|81.96%|83.93%|83.80%|85.53% top-1 accuracies for Mobile Net-V1|Res Net-50|Conv Ne Xt-T|Mixer-S/16|Mixer B/16|Vi T-S/16|Swin-T|Vi T-B/16 models trained on Image Net-1K dataset from scratch, showing 3.05%|3.39%|2.02%|4.61%|5.52%|4.03%|2.62%|3.73% absolute gains to the individually trained counterparts. Intriguingly, when scaling up the size of teacher models or their pre-training datasets, our method showcases the desired scalable properties, bringing increasingly larger gains to student models. We also empirically show that the student backbones trained by our method transfer well on downstream MS-COCO and ADE20K datasets. |
| Researcher Affiliation | Industry | Jiawei Fan Intel Labs China jiawei.fan@intel.com Chao Li Intel Labs China chao3.li@intel.com Xiaolong Liu i Motion Automotive Technology xiaolong.liu@imotion.ai Anbang Yao Intel Labs China anbang.yao@intel.com |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code is available at https://github.com/deep-optimization/Scale KD. |
| Open Datasets | Yes | Unless otherwise stated, in experiments, the student backbones are trained on IN-1K from scratch, without the pre-training on other upstream datasets. Experimental details are in Appendix A and B. Image Net-1K [45] is a well-known large-scale classification dataset, comprising over 1.2 million training images and 50,000 validation images with 1,000 object categories. |
| Dataset Splits | Yes | Image Net-1K [45] is a well-known large-scale classification dataset, comprising over 1.2 million training images and 50,000 validation images with 1,000 object categories. |
| Hardware Specification | Yes | The experiments using the traditional training strategy are conducted on 8 NVIDIA Tesla-V100 GPUs, while the experiments using the advanced training strategy are conducted on 32 NVIDIA Tesla-V100 GPUs. All experiments on MS-COCO and ADE20K are conducted on 8 NVIDIA Tesla-V100 GPUs. |
| Software Dependencies | No | The paper states, 'We implement our method based on MMClassification [67]' and 'We conduct experiments based on MMDetection [71] and MMSegmentation [72].' While these frameworks are cited with publication years, specific version numbers (e.g., 'MMClassification vX.Y') are not explicitly provided. |
| Experiment Setup | Yes | We conduct our experiments with two popular training strategies: traditional training strategy and advanced training strategy. The traditional training strategy is commonly used in previous KD approaches (shown in Table 10a) and the advanced training strategy is adopted in training recently proposed CNNs, MLPs, and Vi Ts (shown in Table 10b). Table 10a and 10b provide detailed settings including Batch Size, Learning Rate, Learning Rate Schedule, Optimizer, Optimizer Hyper-Parameters, Weight Decay, Training Epochs, Warmup Epochs, Drop Path, Label Smoothing, Random Flip, Random Resize Crop, Random Augmentation, and Random Erasing. |