reproducibilityindex.ai

Adversarial Data Augmentation for Task-Specific Knowledge Distillation of Pre-trained Transformers

Authors: Minjia Zhang, Niranjan Uma Naresh, Yuxiong He11685-11693

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results show that this method allows better transfer of knowledge from the teacher to the student during distillation, producing student models that retain 99.6% accuracy of the teacher model while outperforming existing task-speciﬁc knowledge distillation baselines by 1.2 points on average over a variety of natural language understanding tasks.
Researcher Affiliation	Industry	Microsoft Corporation Bellevue, Washington 98004 {minjiaz,Niranjan.Uma,yuxhe}@microsoft.com
Pseudocode	Yes	The full procedure of AD2 is provided in Algorithm 1.
Open Source Code	No	The paper does not provide concrete access to source code developed for the described methodology.
Open Datasets	Yes	Following previous work on distilling pre-trained language model (Sanh et al. 2019; Sun et al. 2019; Dong et al. 2019), we evaluate the effectiveness of AD2 using the GLUE (General Language Understanding Evaluation) benchmark (Wang et al. 2019), a collection of linguistic tasks in different domains such as textual entailment, sentiment analysis, and question answering.
Dataset Splits	Yes	The model with the best validation accuracy is selected for each task, and we report the median of 5 runs with different random seeds for each selected conﬁguration.
Hardware Specification	No	The paper mentions 'one NVIDIA V100 GPU' in the context of comparing training time with Tiny BERT, but does not explicitly state that all or any specific part of their own experiments were run on this hardware. It also mentions 'infrastructure supported by the Deep Speed team and Azure Integrated Training Platform (ITP)' which are general cloud environments without specific hardware details.
Software Dependencies	No	The paper does not provide specific ancillary software details with version numbers (e.g., Python, PyTorch, or other libraries with their versions).
Experiment Setup	Yes	Hyperparameters. In order to reduce the hyperparameter search space, we ﬁx the number of epochs as 6 for all the experiments and tune the batch size from {16, 32} and learning rate from {1e-5, 3e-5, 5e-5, 7e-5, 9e-5, 1e-4} for all conﬁgurations on each task. The maximum sequence length is set to 512. We use a linear learning rate decay schedule with a warm-up ratio of 0.1 for all experiments. We clip the gradient norm within 1. For AD2, we set the perturbation radius ϵ = 1e-5, PGA step size 1e-3, temperature t=1, and α = 1.