Adversarial Data Augmentation for Task-Specific Knowledge Distillation of Pre-trained Transformers
Authors: Minjia Zhang, Niranjan Uma Naresh, Yuxiong He11685-11693
AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results show that this method allows better transfer of knowledge from the teacher to the student during distillation, producing student models that retain 99.6% accuracy of the teacher model while outperforming existing task-specific knowledge distillation baselines by 1.2 points on average over a variety of natural language understanding tasks. |
| Researcher Affiliation | Industry | Microsoft Corporation Bellevue, Washington 98004 {minjiaz,Niranjan.Uma,yuxhe}@microsoft.com |
| Pseudocode | Yes | The full procedure of AD2 is provided in Algorithm 1. |
| Open Source Code | No | The paper does not provide concrete access to source code developed for the described methodology. |
| Open Datasets | Yes | Following previous work on distilling pre-trained language model (Sanh et al. 2019; Sun et al. 2019; Dong et al. 2019), we evaluate the effectiveness of AD2 using the GLUE (General Language Understanding Evaluation) benchmark (Wang et al. 2019), a collection of linguistic tasks in different domains such as textual entailment, sentiment analysis, and question answering. |
| Dataset Splits | Yes | The model with the best validation accuracy is selected for each task, and we report the median of 5 runs with different random seeds for each selected configuration. |
| Hardware Specification | No | The paper mentions 'one NVIDIA V100 GPU' in the context of comparing training time with Tiny BERT, but does not explicitly state that all or any specific part of their own experiments were run on this hardware. It also mentions 'infrastructure supported by the Deep Speed team and Azure Integrated Training Platform (ITP)' which are general cloud environments without specific hardware details. |
| Software Dependencies | No | The paper does not provide specific ancillary software details with version numbers (e.g., Python, PyTorch, or other libraries with their versions). |
| Experiment Setup | Yes | Hyperparameters. In order to reduce the hyperparameter search space, we fix the number of epochs as 6 for all the experiments and tune the batch size from {16, 32} and learning rate from {1e-5, 3e-5, 5e-5, 7e-5, 9e-5, 1e-4} for all configurations on each task. The maximum sequence length is set to 512. We use a linear learning rate decay schedule with a warm-up ratio of 0.1 for all experiments. We clip the gradient norm within 1. For AD2, we set the perturbation radius ϵ = 1e-5, PGA step size 1e-3, temperature t=1, and α = 1. |