reproducibilityindex.ai

Few-shot Task-agnostic Neural Architecture Search for Distilling Large Language Models

Authors: Dongkuan (DK) Xu, Subhabrata Mukherjee, Xiaodong Liu, Debadeepta Dey, Wenhui Wang, Xiang Zhang, Ahmed Awadallah, Jianfeng Gao

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on GLUE benchmark demonstrate Auto Distil to outperform state-of-the-art KD and NAS methods with upto 41x reduction in computational cost.
Researcher Affiliation	Collaboration	Dongkuan Xu NC State University dxu27@ncsu.edu Subhabrata Mukherjee Microsoft Research submukhe@microsoft.com Xiaodong Liu Microsoft Research Debadeepta Dey Microsoft Research Wenhui Wang Microsoft Research Xiang Zhang Penn State University Ahmed Hassan Awadallah Microsoft Research Jianfeng Gao Microsoft Research
Pseudocode	Yes	Algorithm 1 Few-shot Task-agnostic Knowledge Distillation with Auto Distil .
Open Source Code	Yes	Code and models are available at aka.ms/autodistil.
Open Datasets	Yes	We conduct experiments on General Language Understanding Evaluation (GLUE) benchmark [30]. We use English Wikipedia and Book Corpus data for Super LM training with Word Piece tokenization.
Dataset Splits	Yes	We conduct experiments on General Language Understanding Evaluation (GLUE) benchmark [30]. We compare our method with the baseline methods on two single-sentence classiﬁcation tasks (Co LA [31], SST-2 [32]), two similarity and paraphrase tasks (MRPC [33], QQP [34]), and three inference tasks (MNLI [29], QNLI [35], RTE [36, 37, 38, 39])3. We compute the task-agnostic self-attention distillation loss for all student subnetworks using Eqn. (4) on a heldout validation set from the unlabeled training corpus.
Hardware Specification	Yes	We use 16 V 100 GPUs to train the Super LM with 336 GPU-hours as the training cost.
Software Dependencies	No	No specific software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x, CUDA 11.x) are explicitly listed.
Experiment Setup	Yes	We use a batch size of 128 and 4e-5 as the peak learning rate for 10 epochs. The maximum sequence length is set to 128. The coefﬁcients in distillation objective (Eqn. (4)), β1, β2, and β3, are all set to 1. We distill the self-attention knowledge of the last layer to train the Super LM.