Self-supervised Models are Good Teaching Assistants for Vision Transformers

Authors: Haiyan Wu, Yuting Gao, Yinqi Zhang, Shaohui Lin, Yuan Xie, Xing Sun, Ke Li

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments verify the effectiveness of SSTA and demonstrate that the proposed SSTA is a good compensation to the supervised teacher. [...] Extensive experiments are conducted to demonstrate the advantage of the self-supervised teaching assistant.
Researcher Affiliation Collaboration 1School of Computer Science and Technology, East China Normal University, Shanghai, China 2Tencent Youtu Lab, Shanghai, China.
Pseudocode No The paper describes the methodology using text and mathematical equations for attention computation (Eq 1 and 2), but does not provide any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes The code is released in https://github.com/Glassy Wu/SSTA
Open Datasets Yes Image Net (Russakovsky et al., 2015) is used to verify the effectiveness of our method. CIFAR-10 and CIFAR-100 (Krizhevsky et al., 2009) are adopted for downstream transfering tasks. Image Net-C (Hendrycks & Dietterich, 2019) is utilized to analyze the robustness of the representations. SIN dataset (Geirhos et al., 2018) is used to evaluate the shape bias of models.
Dataset Splits Yes The abscissa is the top 10 categories in the validation dataset of Image Net predicted by SL teacher and SSL teacher, and the ordinate is the specific number. [...] The total number of distillation epochs are 300 and 400 for Dei T and XCi T respectively, and the corresponding early stop epochs are 100 and 150.
Hardware Specification No The paper describes various experimental settings, including datasets and training parameters, but does not specify the hardware (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies No The paper does not provide specific version numbers for any software dependencies or libraries used in the implementation, such as Python, PyTorch, or CUDA versions.
Experiment Setup Yes Following Dei T and XCi T, the total number of distillation epochs are 300 and 400 for Dei T and XCi T respectively, and the corresponding early stop epochs are 100 and 150. [...] The total loss is defined as follows: LT otal = α LCE(f S(X), y) + β LSL KD + λ LSSL KD , (7) where LCE( ) denotes Cross Entropy, and y is ground truth. α, β and λ are the hyper-parameters that control the weights of CE loss and distillation loss.