Self-supervised Models are Good Teaching Assistants for Vision Transformers
Authors: Haiyan Wu, Yuting Gao, Yinqi Zhang, Shaohui Lin, Yuan Xie, Xing Sun, Ke Li
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments verify the effectiveness of SSTA and demonstrate that the proposed SSTA is a good compensation to the supervised teacher. [...] Extensive experiments are conducted to demonstrate the advantage of the self-supervised teaching assistant. |
| Researcher Affiliation | Collaboration | 1School of Computer Science and Technology, East China Normal University, Shanghai, China 2Tencent Youtu Lab, Shanghai, China. |
| Pseudocode | No | The paper describes the methodology using text and mathematical equations for attention computation (Eq 1 and 2), but does not provide any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code is released in https://github.com/Glassy Wu/SSTA |
| Open Datasets | Yes | Image Net (Russakovsky et al., 2015) is used to verify the effectiveness of our method. CIFAR-10 and CIFAR-100 (Krizhevsky et al., 2009) are adopted for downstream transfering tasks. Image Net-C (Hendrycks & Dietterich, 2019) is utilized to analyze the robustness of the representations. SIN dataset (Geirhos et al., 2018) is used to evaluate the shape bias of models. |
| Dataset Splits | Yes | The abscissa is the top 10 categories in the validation dataset of Image Net predicted by SL teacher and SSL teacher, and the ordinate is the specific number. [...] The total number of distillation epochs are 300 and 400 for Dei T and XCi T respectively, and the corresponding early stop epochs are 100 and 150. |
| Hardware Specification | No | The paper describes various experimental settings, including datasets and training parameters, but does not specify the hardware (e.g., GPU/CPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper does not provide specific version numbers for any software dependencies or libraries used in the implementation, such as Python, PyTorch, or CUDA versions. |
| Experiment Setup | Yes | Following Dei T and XCi T, the total number of distillation epochs are 300 and 400 for Dei T and XCi T respectively, and the corresponding early stop epochs are 100 and 150. [...] The total loss is defined as follows: LT otal = α LCE(f S(X), y) + β LSL KD + λ LSSL KD , (7) where LCE( ) denotes Cross Entropy, and y is ground truth. α, β and λ are the hyper-parameters that control the weights of CE loss and distillation loss. |