LRC-BERT: Latent-representation Contrastive Knowledge Distillation for Natural Language Understanding
Authors: Hao Fu, Shaojun Zhou, Qihong Yang, Junjie Tang, Guiquan Liu, Kaikui Liu, Xiaolong Li12830-12838
AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, by verifying 9 datasets on the General Language Understanding Evaluation (GLUE) benchmark, the performance of the proposed LRC-BERT exceeds the existing state-of-the-art methods, which proves the effectiveness of our method. |
| Researcher Affiliation | Collaboration | 1School of Computer Science and Technology, University of Science and Technology of China 2Alibaba Group |
| Pseudocode | No | The paper does not contain any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | No | The paper does not provide an explicit statement or link to its open-source code for the described methodology. |
| Open Datasets | Yes | We evaluate LRC-BERT on GLUE benchmark. The datasets provided on GLUE were all from NLP datasets with high recognition. We evaluate LRC-BERT in tasks such as natural language reasoning, emotion analysis, reading comprehension and semantic similarity. |
| Dataset Splits | No | The paper refers to using 'dev' sets for evaluation (e.g., 'The evaluation results of these four tasks on dev are shown in Table 3.'), and provides training sample counts for datasets in Table 1, but it does not specify explicit percentages or absolute counts for training, validation, and test splits or the methodology for these splits (e.g., '80/10/10 split'). |
| Hardware Specification | Yes | We distill our student model with 6 V100 in the pretraining stage, and 4 V100 for distillation training on specific task dataset and extended dataset. In the inference experiments, we report the results of the student on a single V100. |
| Software Dependencies | No | The paper does not provide specific version numbers for any software dependencies (e.g., programming languages, libraries, or frameworks). |
| Experiment Setup | Yes | For the distillation of each task on GLUE, we fine-tune a BERT-base teacher, choosing learning rates of 5e-5, 1e4, and 3e-4 with batchsize of 16 to distill LRC-BERT and LRC-BERT1. For each sample, we choose the remaining 15 samples in batchsize as negative samples, i.e. K = 15. Among them, 90 epochs of distillation are performed on the MRPC, RTE, and Co LA with the training dataset less than 10K, and 18 epochs of distillation on other tasks. For the proposed two-stage training method, the first 80% of the steps are chosen as the first stage of training, the rest 20% of the steps are the second stage. Then, we set the parameters of the second stage to α : β : γ = 1 : 1 : 3, and the search range of each parameter is {1,2,3,4}. For the hyperparametric temperature τ, we set it to 1.1. |