LRC-BERT: Latent-representation Contrastive Knowledge Distillation for Natural Language Understanding

Authors: Hao Fu, Shaojun Zhou, Qihong Yang, Junjie Tang, Guiquan Liu, Kaikui Liu, Xiaolong Li12830-12838

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, by verifying 9 datasets on the General Language Understanding Evaluation (GLUE) benchmark, the performance of the proposed LRC-BERT exceeds the existing state-of-the-art methods, which proves the effectiveness of our method.
Researcher Affiliation Collaboration 1School of Computer Science and Technology, University of Science and Technology of China 2Alibaba Group
Pseudocode No The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code No The paper does not provide an explicit statement or link to its open-source code for the described methodology.
Open Datasets Yes We evaluate LRC-BERT on GLUE benchmark. The datasets provided on GLUE were all from NLP datasets with high recognition. We evaluate LRC-BERT in tasks such as natural language reasoning, emotion analysis, reading comprehension and semantic similarity.
Dataset Splits No The paper refers to using 'dev' sets for evaluation (e.g., 'The evaluation results of these four tasks on dev are shown in Table 3.'), and provides training sample counts for datasets in Table 1, but it does not specify explicit percentages or absolute counts for training, validation, and test splits or the methodology for these splits (e.g., '80/10/10 split').
Hardware Specification Yes We distill our student model with 6 V100 in the pretraining stage, and 4 V100 for distillation training on specific task dataset and extended dataset. In the inference experiments, we report the results of the student on a single V100.
Software Dependencies No The paper does not provide specific version numbers for any software dependencies (e.g., programming languages, libraries, or frameworks).
Experiment Setup Yes For the distillation of each task on GLUE, we fine-tune a BERT-base teacher, choosing learning rates of 5e-5, 1e4, and 3e-4 with batchsize of 16 to distill LRC-BERT and LRC-BERT1. For each sample, we choose the remaining 15 samples in batchsize as negative samples, i.e. K = 15. Among them, 90 epochs of distillation are performed on the MRPC, RTE, and Co LA with the training dataset less than 10K, and 18 epochs of distillation on other tasks. For the proposed two-stage training method, the first 80% of the steps are chosen as the first stage of training, the rest 20% of the steps are the second stage. Then, we set the parameters of the second stage to α : β : γ = 1 : 1 : 3, and the search range of each parameter is {1,2,3,4}. For the hyperparametric temperature τ, we set it to 1.1.