Learning to Augment for Data-scarce Domain BERT Knowledge Distillation

Authors: Lingyun Feng, Minghui Qiu, Yaliang Li, Hai-Tao Zheng, Ying Shen7422-7430

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that the proposed method significantly outperforms state-of-the-art baselines on four different tasks, and for the data-scarce domains, the compressed student models even perform better than the original large teacher model, with much fewer parameters (only 13.3%) when only a few labeled examples available. We conduct experiments on four NLP tasks to examine the efficiency and effectiveness of the proposed method.
Researcher Affiliation Collaboration Lingyun Feng1, , Minghui Qiu2, , Yaliang Li2, Hai-Tao Zheng1, , Ying Shen3, 1 Tsinghua University 2 Alibaba Group 3 Sun-Yat Sen University
Pseudocode Yes Algorithm 1: Learning to Augment for Data-Scarce Domain BERT Compression
Open Source Code No The paper references third-party open-source code for baselines and initializations (e.g., 'https://github.com/microsoft/ unilm/tree/master/minilm' and 'https://github.com/huawei-noah/Pretrained-Language Model/tree/master/Tiny BERT (2nd version)'), but does not provide a link or explicit statement that the authors' own code for the described methodology is publicly available.
Open Datasets Yes We use Multi NLI (Williams, Nangia, and Bowman 2018) as the source domain and Sci Tail (Khot, Sabharwal, and Clark 2018) as the target. We treat the Quora question pairs 2 as the source domain and a paraphrase dataset made available in CIKM Analyti Cup 2018 3 as the target. We treat SST-2 (Socher et al. 2013) as source domain and RT (Pang and Lee 2005) as target. We use the Electronics domain in the Amazon review dataset (Mc Auley and Leskovec 2013) as source data and the Watches domain as the target. Footnotes 2 and 3 provide links: '2www.kaggle.com/c/quora-question-pairs' and '3https://tianchi.aliyun.com/competition/introduction.htm? race Id=231661'.
Dataset Splits No The paper specifies how the training sets are subsampled: 'To mimic data-scarce domains, we subsample a small training set from the target domain for NLI and text classification tasks by randomly picking 40 instances for each class, and take 1% of the original data as our training data for review helpfulness prediction task.' While 'validation data Dv' is mentioned in Algorithm 1, no specific details (size, percentage, or method of creation) are provided for the validation split itself.
Hardware Specification No The paper does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory) used to conduct the experiments.
Software Dependencies No The paper states: 'All models are implemented with Py Torch (Paszke et al. 2019) and Python 3.6.' While Python has a version number, PyTorch is mentioned only by citation without a specific version number, which is insufficient for full reproducibility according to the criteria.
Experiment Setup Yes We set the maximum sequence length to 128. We tune the temperature α from {0.6,0.7,0.8,0.9,1.0} and choose α = 0.6 for the best performance. We tune T from {1,2,4,8} and choose T = 1 for the best performance. The sample size is 20 for target domain and 1 for source domain. The batch size is chosen from {8,16,32} and the learning rate is tuned from {2e-5, 3e-5, 5e-5}. For the reinforced selector, We use Adam optimizer (Kingma and Ba 2015) with the setting β1 = 0.9, β2 = 0.998. The size of the hidden layer of the policy network is 128. The learning rate is set to 3e-5.