Adversarial Data Augmentation for Task-Specific Knowledge Distillation of Pre-trained Transformers

Authors: Minjia Zhang, Niranjan Uma Naresh, Yuxiong He11685-11693

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results show that this method allows better transfer of knowledge from the teacher to the student during distillation, producing student models that retain 99.6% accuracy of the teacher model while outperforming existing task-specific knowledge distillation baselines by 1.2 points on average over a variety of natural language understanding tasks.
Researcher Affiliation Industry Microsoft Corporation Bellevue, Washington 98004 {minjiaz,Niranjan.Uma,yuxhe}@microsoft.com
Pseudocode Yes The full procedure of AD2 is provided in Algorithm 1.
Open Source Code No The paper does not provide concrete access to source code developed for the described methodology.
Open Datasets Yes Following previous work on distilling pre-trained language model (Sanh et al. 2019; Sun et al. 2019; Dong et al. 2019), we evaluate the effectiveness of AD2 using the GLUE (General Language Understanding Evaluation) benchmark (Wang et al. 2019), a collection of linguistic tasks in different domains such as textual entailment, sentiment analysis, and question answering.
Dataset Splits Yes The model with the best validation accuracy is selected for each task, and we report the median of 5 runs with different random seeds for each selected configuration.
Hardware Specification No The paper mentions 'one NVIDIA V100 GPU' in the context of comparing training time with Tiny BERT, but does not explicitly state that all or any specific part of their own experiments were run on this hardware. It also mentions 'infrastructure supported by the Deep Speed team and Azure Integrated Training Platform (ITP)' which are general cloud environments without specific hardware details.
Software Dependencies No The paper does not provide specific ancillary software details with version numbers (e.g., Python, PyTorch, or other libraries with their versions).
Experiment Setup Yes Hyperparameters. In order to reduce the hyperparameter search space, we fix the number of epochs as 6 for all the experiments and tune the batch size from {16, 32} and learning rate from {1e-5, 3e-5, 5e-5, 7e-5, 9e-5, 1e-4} for all configurations on each task. The maximum sequence length is set to 512. We use a linear learning rate decay schedule with a warm-up ratio of 0.1 for all experiments. We clip the gradient norm within 1. For AD2, we set the perturbation radius ϵ = 1e-5, PGA step size 1e-3, temperature t=1, and α = 1.