HomoDistil: Homotopic Task-Agnostic Distillation of Pre-trained Transformers
Authors: Chen Liang, Haoming Jiang, Zheng Li, Xianfeng Tang, Bing Yin, Tuo Zhao
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that Homo Distil achieves significant improvements on existing baselines. |
| Researcher Affiliation | Collaboration | Georgia Institute of Technology, Amazon {cliang73,tourzhao}@gatech.edu, {jhaoming,amzzhe,xianft,alexbyin}@amazon.com |
| Pseudocode | Yes | The complete algorithm is shown in Alg. 1. Algorithm 1 Homo Distil: Homotopic Distillation |
| Open Source Code | No | The paper does not provide an explicit statement about releasing source code for the described methodology or a link to a repository for Homo Distil. |
| Open Datasets | Yes | We distill the student using the open-domain corpus for BERT pre-training (Devlin et al., 2018), i.e., Wikipedia 2, an English Wikipedia corpus containing 2500M words, and Toronto Book Corpus (Zhu et al., 2015), containing 800M words. |
| Dataset Splits | Yes | Table 16: Summary of the GLUE benchmark. Corpus Task #Train #Dev #Test #Label Metrics ... MNLI NLI 393k 20k 20k 3 Accuracy |
| Hardware Specification | Yes | The continual pre-training experiment runs for around 13 hours on 8 Nvidia A100 GPUs. |
| Software Dependencies | No | The paper mentions using PyTorch's profiler package and the Adam optimizer, but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | For all experiments, we use a max sequence length of 128 and a batch size of 4k. We train the student model for T = 28k steps (3 epochs). We use Adam (Kingma & Ba, 2014) as the optimizer with =(0.9, 0.999), = 1 10 6. We use a learning rate of 3 10 4 for Homo BERT-base and 6 10 4 for Homo BERT-small/xsmall/tiny. |