MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers
Authors: Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, Ming Zhou
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results demonstrate that our monolingual model2 outperforms stateof-the-art baselines in different parameter size of student models. We conduct extensive experiments on downstream NLP tasks. We conduct monolingual and multilingual distillation experiments in different parameter size of student models. |
| Researcher Affiliation | Industry | Wenhui Wang Furu Wei Li Dong Hangbo Bao Nan Yang Ming Zhou Microsoft Research {wenwan,fuwei,lidong1,t-habao,nanya,mingzhou}@microsoft.com |
| Pseudocode | No | The paper describes methods in prose and with diagrams, but does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code and models will be publicly available at https://aka.ms/minilm. |
| Open Datasets | Yes | We use documents of English Wikipedia3 and Book Corpus [49] for the pre-training data, following the preprocess and the Word Piece tokenization of Devlin et al. [12]. We evaluate on SQu AD 2.0 [31], which has served as a major question answering benchmark. GLUE The General Language Understanding Evaluation benchmark [44] consists of nine sentencelevel classification tasks. We evaluate the student models on cross-lingual natural language inference (XNLI) benchmark [9] and cross-lingual question answering (MLQA) benchmark [24]. |
| Dataset Splits | No | The paper mentions evaluating on 'dev sets' for various benchmarks (e.g., GLUE, MNLI, SST-2, SQuAD 2.0) and using 'MLQA English development data for early stopping'. While this implies the use of validation data, it does not explicitly provide the specific dataset splits (percentages or counts) used by the authors for these experiments, relying on the reader's knowledge of the benchmark's standard splits. |
| Hardware Specification | No | The paper does not provide specific hardware details (exact GPU/CPU models, processor types, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific ancillary software details, such as library names with version numbers, needed to replicate the experiment. |
| Experiment Setup | No | The paper mentions that models are 'trained using the same data and hyper-parameters' as a baseline, but does not explicitly list the specific hyperparameter values (e.g., learning rate, batch size, number of epochs, optimizer settings) or system-level training settings used for its own experiments in the main text. |