Learn-to-Share: A Hardware-friendly Transfer Learning Framework Exploiting Computation and Parameter Sharing
Authors: Cheng Fu, Hanxian Huang, Xinyun Chen, Yuandong Tian, Jishen Zhao
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments show that with 1.4% of extra parameters per task, Le TS reduces the computation by 49.5% on GLUE benchmarks with only 0.2% accuracy loss compared to full fine-tuning. |
| Researcher Affiliation | Collaboration | 1University of California, San Diego 2University of California, Berkeley 3Facebook AI Research. |
| Pseudocode | Yes | The detailed design flow is shown in Algorithm 1. |
| Open Source Code | No | The paper states 'Our pre-trained models and code base are from (Wolf et al., 2020).', which refers to a third-party library used, but does not explicitly state that the authors' own code for Le TS is open-sourced or provide a link. |
| Open Datasets | Yes | We evaluate Le TS on General Language Understanding Evaluation (GLUE) benchmark (Wang et al., 2018) consists of the following nine tasks: The Corpus of Linguistic Acceptability (Co LA). The Stanford Sentiment Treebank (SST-2). The Microsoft Research Paraphrase Corpus (MRPC). The Quora Question Pairs (QQP). The Se- mantic Textual Similarity Benchmark (STS-B). The Multi Genre Natural Language Inference Corpus (MNLI) (We test on both matched domain MNLIm and mismatched domain MNLImm). The Stanford Question Answering Dataset (QNLI). The Recognizing Textual Entailment (RTE). |
| Dataset Splits | Yes | Table 2: Sensitivity study to sparsity ratio constraint and comparison to parameter-sharing baselines on GLUE dev dataset. |
| Hardware Specification | Yes | The DNAS method takes 1 day on 4 V100 GPUs per task on average which is less than 0.5% of the pre-training cost of BERTLARGE (Devlin et al., 2019). |
| Software Dependencies | No | The paper mentions 'Pytorch Sparse', 'Tensorflow-Sparse', and 'Wolf et al., 2020' (HuggingFace Transformers) as used software components, but does not provide specific version numbers for any of them. |
| Experiment Setup | Yes | We use Nsteps = 100 to initialize W δ (Sec. 3.1). Inspired by (Ravfogel & Goldberg) that the bias terms requires a larger learning rate to achieve better fine-tuning results, we apply two optimizers with different learning rate scheduler to update the bias terms (lrb {1e 3,5e 4}) and other parts (lrw {2e 5,1e 5}) separately during the final fine-tuning. Details of other hyperparameters are shown in Appendix A. |