reproducibilityindex.ai

HomoDistil: Homotopic Task-Agnostic Distillation of Pre-trained Transformers

Authors: Chen Liang, Haoming Jiang, Zheng Li, Xianfeng Tang, Bing Yin, Tuo Zhao

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate that Homo Distil achieves significant improvements on existing baselines.
Researcher Affiliation	Collaboration	Georgia Institute of Technology, Amazon {cliang73,tourzhao}@gatech.edu, {jhaoming,amzzhe,xianft,alexbyin}@amazon.com
Pseudocode	Yes	The complete algorithm is shown in Alg. 1. Algorithm 1 Homo Distil: Homotopic Distillation
Open Source Code	No	The paper does not provide an explicit statement about releasing source code for the described methodology or a link to a repository for Homo Distil.
Open Datasets	Yes	We distill the student using the open-domain corpus for BERT pre-training (Devlin et al., 2018), i.e., Wikipedia 2, an English Wikipedia corpus containing 2500M words, and Toronto Book Corpus (Zhu et al., 2015), containing 800M words.
Dataset Splits	Yes	Table 16: Summary of the GLUE benchmark. Corpus Task #Train #Dev #Test #Label Metrics ... MNLI NLI 393k 20k 20k 3 Accuracy
Hardware Specification	Yes	The continual pre-training experiment runs for around 13 hours on 8 Nvidia A100 GPUs.
Software Dependencies	No	The paper mentions using PyTorch's profiler package and the Adam optimizer, but does not provide specific version numbers for these or other software dependencies.
Experiment Setup	Yes	For all experiments, we use a max sequence length of 128 and a batch size of 4k. We train the student model for T = 28k steps (3 epochs). We use Adam (Kingma & Ba, 2014) as the optimizer with =(0.9, 0.999), = 1 10 6. We use a learning rate of 3 10 4 for Homo BERT-base and 6 10 4 for Homo BERT-small/xsmall/tiny.