Few-shot Task-agnostic Neural Architecture Search for Distilling Large Language Models
Authors: Dongkuan (DK) Xu, Subhabrata Mukherjee, Xiaodong Liu, Debadeepta Dey, Wenhui Wang, Xiang Zhang, Ahmed Awadallah, Jianfeng Gao
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on GLUE benchmark demonstrate Auto Distil to outperform state-of-the-art KD and NAS methods with upto 41x reduction in computational cost. |
| Researcher Affiliation | Collaboration | Dongkuan Xu NC State University dxu27@ncsu.edu Subhabrata Mukherjee Microsoft Research submukhe@microsoft.com Xiaodong Liu Microsoft Research Debadeepta Dey Microsoft Research Wenhui Wang Microsoft Research Xiang Zhang Penn State University Ahmed Hassan Awadallah Microsoft Research Jianfeng Gao Microsoft Research |
| Pseudocode | Yes | Algorithm 1 Few-shot Task-agnostic Knowledge Distillation with Auto Distil . |
| Open Source Code | Yes | Code and models are available at aka.ms/autodistil. |
| Open Datasets | Yes | We conduct experiments on General Language Understanding Evaluation (GLUE) benchmark [30]. We use English Wikipedia and Book Corpus data for Super LM training with Word Piece tokenization. |
| Dataset Splits | Yes | We conduct experiments on General Language Understanding Evaluation (GLUE) benchmark [30]. We compare our method with the baseline methods on two single-sentence classification tasks (Co LA [31], SST-2 [32]), two similarity and paraphrase tasks (MRPC [33], QQP [34]), and three inference tasks (MNLI [29], QNLI [35], RTE [36, 37, 38, 39])3. We compute the task-agnostic self-attention distillation loss for all student subnetworks using Eqn. (4) on a heldout validation set from the unlabeled training corpus. |
| Hardware Specification | Yes | We use 16 V 100 GPUs to train the Super LM with 336 GPU-hours as the training cost. |
| Software Dependencies | No | No specific software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x, CUDA 11.x) are explicitly listed. |
| Experiment Setup | Yes | We use a batch size of 128 and 4e-5 as the peak learning rate for 10 epochs. The maximum sequence length is set to 128. The coefficients in distillation objective (Eqn. (4)), β1, β2, and β3, are all set to 1. We distill the self-attention knowledge of the last layer to train the Super LM. |