reproducibilityindex.ai

Learn-to-Share: A Hardware-friendly Transfer Learning Framework Exploiting Computation and Parameter Sharing

Authors: Cheng Fu, Hanxian Huang, Xinyun Chen, Yuandong Tian, Jishen Zhao

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments show that with 1.4% of extra parameters per task, Le TS reduces the computation by 49.5% on GLUE benchmarks with only 0.2% accuracy loss compared to full ﬁne-tuning.
Researcher Affiliation	Collaboration	1University of California, San Diego 2University of California, Berkeley 3Facebook AI Research.
Pseudocode	Yes	The detailed design ﬂow is shown in Algorithm 1.
Open Source Code	No	The paper states 'Our pre-trained models and code base are from (Wolf et al., 2020).', which refers to a third-party library used, but does not explicitly state that the authors' own code for Le TS is open-sourced or provide a link.
Open Datasets	Yes	We evaluate Le TS on General Language Understanding Evaluation (GLUE) benchmark (Wang et al., 2018) consists of the following nine tasks: The Corpus of Linguistic Acceptability (Co LA). The Stanford Sentiment Treebank (SST-2). The Microsoft Research Paraphrase Corpus (MRPC). The Quora Question Pairs (QQP). The Se- mantic Textual Similarity Benchmark (STS-B). The Multi Genre Natural Language Inference Corpus (MNLI) (We test on both matched domain MNLIm and mismatched domain MNLImm). The Stanford Question Answering Dataset (QNLI). The Recognizing Textual Entailment (RTE).
Dataset Splits	Yes	Table 2: Sensitivity study to sparsity ratio constraint and comparison to parameter-sharing baselines on GLUE dev dataset.
Hardware Specification	Yes	The DNAS method takes 1 day on 4 V100 GPUs per task on average which is less than 0.5% of the pre-training cost of BERTLARGE (Devlin et al., 2019).
Software Dependencies	No	The paper mentions 'Pytorch Sparse', 'Tensorﬂow-Sparse', and 'Wolf et al., 2020' (HuggingFace Transformers) as used software components, but does not provide specific version numbers for any of them.
Experiment Setup	Yes	We use Nsteps = 100 to initialize W δ (Sec. 3.1). Inspired by (Ravfogel & Goldberg) that the bias terms requires a larger learning rate to achieve better ﬁne-tuning results, we apply two optimizers with different learning rate scheduler to update the bias terms (lrb {1e 3,5e 4}) and other parts (lrw {2e 5,1e 5}) separately during the ﬁnal ﬁne-tuning. Details of other hyperparameters are shown in Appendix A.