Few-shot Task-agnostic Neural Architecture Search for Distilling Large Language Models

Authors: Dongkuan (DK) Xu, Subhabrata Mukherjee, Xiaodong Liu, Debadeepta Dey, Wenhui Wang, Xiang Zhang, Ahmed Awadallah, Jianfeng Gao

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on GLUE benchmark demonstrate Auto Distil to outperform state-of-the-art KD and NAS methods with upto 41x reduction in computational cost.
Researcher Affiliation Collaboration Dongkuan Xu NC State University dxu27@ncsu.edu Subhabrata Mukherjee Microsoft Research submukhe@microsoft.com Xiaodong Liu Microsoft Research Debadeepta Dey Microsoft Research Wenhui Wang Microsoft Research Xiang Zhang Penn State University Ahmed Hassan Awadallah Microsoft Research Jianfeng Gao Microsoft Research
Pseudocode Yes Algorithm 1 Few-shot Task-agnostic Knowledge Distillation with Auto Distil .
Open Source Code Yes Code and models are available at aka.ms/autodistil.
Open Datasets Yes We conduct experiments on General Language Understanding Evaluation (GLUE) benchmark [30]. We use English Wikipedia and Book Corpus data for Super LM training with Word Piece tokenization.
Dataset Splits Yes We conduct experiments on General Language Understanding Evaluation (GLUE) benchmark [30]. We compare our method with the baseline methods on two single-sentence classification tasks (Co LA [31], SST-2 [32]), two similarity and paraphrase tasks (MRPC [33], QQP [34]), and three inference tasks (MNLI [29], QNLI [35], RTE [36, 37, 38, 39])3. We compute the task-agnostic self-attention distillation loss for all student subnetworks using Eqn. (4) on a heldout validation set from the unlabeled training corpus.
Hardware Specification Yes We use 16 V 100 GPUs to train the Super LM with 336 GPU-hours as the training cost.
Software Dependencies No No specific software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x, CUDA 11.x) are explicitly listed.
Experiment Setup Yes We use a batch size of 128 and 4e-5 as the peak learning rate for 10 epochs. The maximum sequence length is set to 128. The coefficients in distillation objective (Eqn. (4)), β1, β2, and β3, are all set to 1. We distill the self-attention knowledge of the last layer to train the Super LM.