Learn-to-Share: A Hardware-friendly Transfer Learning Framework Exploiting Computation and Parameter Sharing

Authors: Cheng Fu, Hanxian Huang, Xinyun Chen, Yuandong Tian, Jishen Zhao

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments show that with 1.4% of extra parameters per task, Le TS reduces the computation by 49.5% on GLUE benchmarks with only 0.2% accuracy loss compared to full fine-tuning.
Researcher Affiliation Collaboration 1University of California, San Diego 2University of California, Berkeley 3Facebook AI Research.
Pseudocode Yes The detailed design flow is shown in Algorithm 1.
Open Source Code No The paper states 'Our pre-trained models and code base are from (Wolf et al., 2020).', which refers to a third-party library used, but does not explicitly state that the authors' own code for Le TS is open-sourced or provide a link.
Open Datasets Yes We evaluate Le TS on General Language Understanding Evaluation (GLUE) benchmark (Wang et al., 2018) consists of the following nine tasks: The Corpus of Linguistic Acceptability (Co LA). The Stanford Sentiment Treebank (SST-2). The Microsoft Research Paraphrase Corpus (MRPC). The Quora Question Pairs (QQP). The Se- mantic Textual Similarity Benchmark (STS-B). The Multi Genre Natural Language Inference Corpus (MNLI) (We test on both matched domain MNLIm and mismatched domain MNLImm). The Stanford Question Answering Dataset (QNLI). The Recognizing Textual Entailment (RTE).
Dataset Splits Yes Table 2: Sensitivity study to sparsity ratio constraint and comparison to parameter-sharing baselines on GLUE dev dataset.
Hardware Specification Yes The DNAS method takes 1 day on 4 V100 GPUs per task on average which is less than 0.5% of the pre-training cost of BERTLARGE (Devlin et al., 2019).
Software Dependencies No The paper mentions 'Pytorch Sparse', 'Tensorflow-Sparse', and 'Wolf et al., 2020' (HuggingFace Transformers) as used software components, but does not provide specific version numbers for any of them.
Experiment Setup Yes We use Nsteps = 100 to initialize W δ (Sec. 3.1). Inspired by (Ravfogel & Goldberg) that the bias terms requires a larger learning rate to achieve better fine-tuning results, we apply two optimizers with different learning rate scheduler to update the bias terms (lrb {1e 3,5e 4}) and other parts (lrw {2e 5,1e 5}) separately during the final fine-tuning. Details of other hyperparameters are shown in Appendix A.