Self-Distillation for Further Pre-training of Transformers
Authors: Seanie Lee, Minki Kang, Juho Lee, Sung Ju Hwang, Kenji Kawaguchi
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments demonstrate the superiority of self-distillation over relevant baselines on various benchmark datasets for image and text classification tasks. |
| Researcher Affiliation | Collaboration | KAIST1, AITRICS2, National University of Singapore3 {lsnfamily02, zzxc1133, juholee, sjhwang82}@kaist.ac.kr kenji@comp.nus.edu.sg |
| Pseudocode | Yes | Algorithm 1 Self-Distillation; Algorithm 2 Further Pretrain |
| Open Source Code | No | The paper states 'We use Pytorch (Paszke et al., 2019) and transformers library (Wolf et al., 2020) from Huggingface to implement all the baselines and our proposed method in the experiments.' but does not provide a link to their own implementation or explicitly state that their code is open-source. |
| Open Datasets | Yes | For image classification problem, we use six datasets FGVC Aircraft (Aircraft) (Maji et al., 2013), Caltech UCSD Birds 200 (CUB) (Wah et al., 2011), Chest X-ray (Kermany et al., 2018), Describable Textures Dataset (DTD) (Cimpoi et al., 2014), Stanford Dogs (Khosla et al., 2011), and Oxford 102 Flower (Nilsback & Zisserman, 2008). For text classification problem, we use four datasets Chemprot (Kringelum et al., 2016), ACL-ARC (Jurgens et al., 2018), SCIERC (Luan et al., 2018), and Twitter-Emotion (Mohammad et al., 2018). |
| Dataset Splits | No | The paper mentions training and test sets but does not specify clear training/validation/test splits (e.g., percentages or exact counts) for the datasets used in the main experiments, beyond mentioning '50,000 training pairs' for CIFAR-100 without explicit split details. |
| Hardware Specification | Yes | We train a Vision Transformer (Dosovitskiy et al., 2021) on CUB dataset with 3090 RTX GPU and Intel(R) Xeon(R) Silver 4210R CPU. |
| Software Dependencies | No | The paper states 'We use Pytorch (Paszke et al., 2019) and transformers library (Wolf et al., 2020) from Huggingface' but does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | For the image classification problem, we use Vision Transformer... fine-tune it on the downstream task with Adam W optimizer... for 10,000 steps with batch size 32. Regarding further pre-training and self-distillation, we continue to pre-train the model for 20,000 steps with batch size 64. For text classification... fine-tune it on the target labeled dataset with Adam W optimizer for 10 epochs with batch size 32. In terms of further pre-training and self-distillation, we further pre-train Ro BERTA for 100 epochs with batch size 128. Appendix E (Table 9) further specifies hyperparameters like learning rate, weight decay coefficient, and rounds of self-distillation. |