Self-Distillation for Further Pre-training of Transformers

Authors: Seanie Lee, Minki Kang, Juho Lee, Sung Ju Hwang, Kenji Kawaguchi

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments demonstrate the superiority of self-distillation over relevant baselines on various benchmark datasets for image and text classification tasks.
Researcher Affiliation Collaboration KAIST1, AITRICS2, National University of Singapore3 {lsnfamily02, zzxc1133, juholee, sjhwang82}@kaist.ac.kr kenji@comp.nus.edu.sg
Pseudocode Yes Algorithm 1 Self-Distillation; Algorithm 2 Further Pretrain
Open Source Code No The paper states 'We use Pytorch (Paszke et al., 2019) and transformers library (Wolf et al., 2020) from Huggingface to implement all the baselines and our proposed method in the experiments.' but does not provide a link to their own implementation or explicitly state that their code is open-source.
Open Datasets Yes For image classification problem, we use six datasets FGVC Aircraft (Aircraft) (Maji et al., 2013), Caltech UCSD Birds 200 (CUB) (Wah et al., 2011), Chest X-ray (Kermany et al., 2018), Describable Textures Dataset (DTD) (Cimpoi et al., 2014), Stanford Dogs (Khosla et al., 2011), and Oxford 102 Flower (Nilsback & Zisserman, 2008). For text classification problem, we use four datasets Chemprot (Kringelum et al., 2016), ACL-ARC (Jurgens et al., 2018), SCIERC (Luan et al., 2018), and Twitter-Emotion (Mohammad et al., 2018).
Dataset Splits No The paper mentions training and test sets but does not specify clear training/validation/test splits (e.g., percentages or exact counts) for the datasets used in the main experiments, beyond mentioning '50,000 training pairs' for CIFAR-100 without explicit split details.
Hardware Specification Yes We train a Vision Transformer (Dosovitskiy et al., 2021) on CUB dataset with 3090 RTX GPU and Intel(R) Xeon(R) Silver 4210R CPU.
Software Dependencies No The paper states 'We use Pytorch (Paszke et al., 2019) and transformers library (Wolf et al., 2020) from Huggingface' but does not provide specific version numbers for these software components.
Experiment Setup Yes For the image classification problem, we use Vision Transformer... fine-tune it on the downstream task with Adam W optimizer... for 10,000 steps with batch size 32. Regarding further pre-training and self-distillation, we continue to pre-train the model for 20,000 steps with batch size 64. For text classification... fine-tune it on the target labeled dataset with Adam W optimizer for 10 epochs with batch size 32. In terms of further pre-training and self-distillation, we further pre-train Ro BERTA for 100 epochs with batch size 128. Appendix E (Table 9) further specifies hyperparameters like learning rate, weight decay coefficient, and rounds of self-distillation.