HomoDistil: Homotopic Task-Agnostic Distillation of Pre-trained Transformers

Authors: Chen Liang, Haoming Jiang, Zheng Li, Xianfeng Tang, Bing Yin, Tuo Zhao

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that Homo Distil achieves significant improvements on existing baselines.
Researcher Affiliation Collaboration Georgia Institute of Technology, Amazon {cliang73,tourzhao}@gatech.edu, {jhaoming,amzzhe,xianft,alexbyin}@amazon.com
Pseudocode Yes The complete algorithm is shown in Alg. 1. Algorithm 1 Homo Distil: Homotopic Distillation
Open Source Code No The paper does not provide an explicit statement about releasing source code for the described methodology or a link to a repository for Homo Distil.
Open Datasets Yes We distill the student using the open-domain corpus for BERT pre-training (Devlin et al., 2018), i.e., Wikipedia 2, an English Wikipedia corpus containing 2500M words, and Toronto Book Corpus (Zhu et al., 2015), containing 800M words.
Dataset Splits Yes Table 16: Summary of the GLUE benchmark. Corpus Task #Train #Dev #Test #Label Metrics ... MNLI NLI 393k 20k 20k 3 Accuracy
Hardware Specification Yes The continual pre-training experiment runs for around 13 hours on 8 Nvidia A100 GPUs.
Software Dependencies No The paper mentions using PyTorch's profiler package and the Adam optimizer, but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes For all experiments, we use a max sequence length of 128 and a batch size of 4k. We train the student model for T = 28k steps (3 epochs). We use Adam (Kingma & Ba, 2014) as the optimizer with =(0.9, 0.999), = 1 10 6. We use a learning rate of 3 10 4 for Homo BERT-base and 6 10 4 for Homo BERT-small/xsmall/tiny.