MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers

Authors: Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, Ming Zhou

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate that our monolingual model2 outperforms stateof-the-art baselines in different parameter size of student models. We conduct extensive experiments on downstream NLP tasks. We conduct monolingual and multilingual distillation experiments in different parameter size of student models.
Researcher Affiliation Industry Wenhui Wang Furu Wei Li Dong Hangbo Bao Nan Yang Ming Zhou Microsoft Research {wenwan,fuwei,lidong1,t-habao,nanya,mingzhou}@microsoft.com
Pseudocode No The paper describes methods in prose and with diagrams, but does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes The code and models will be publicly available at https://aka.ms/minilm.
Open Datasets Yes We use documents of English Wikipedia3 and Book Corpus [49] for the pre-training data, following the preprocess and the Word Piece tokenization of Devlin et al. [12]. We evaluate on SQu AD 2.0 [31], which has served as a major question answering benchmark. GLUE The General Language Understanding Evaluation benchmark [44] consists of nine sentencelevel classification tasks. We evaluate the student models on cross-lingual natural language inference (XNLI) benchmark [9] and cross-lingual question answering (MLQA) benchmark [24].
Dataset Splits No The paper mentions evaluating on 'dev sets' for various benchmarks (e.g., GLUE, MNLI, SST-2, SQuAD 2.0) and using 'MLQA English development data for early stopping'. While this implies the use of validation data, it does not explicitly provide the specific dataset splits (percentages or counts) used by the authors for these experiments, relying on the reader's knowledge of the benchmark's standard splits.
Hardware Specification No The paper does not provide specific hardware details (exact GPU/CPU models, processor types, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies No The paper does not provide specific ancillary software details, such as library names with version numbers, needed to replicate the experiment.
Experiment Setup No The paper mentions that models are 'trained using the same data and hyper-parameters' as a baseline, but does not explicitly list the specific hyperparameter values (e.g., learning rate, batch size, number of epochs, optimizer settings) or system-level training settings used for its own experiments in the main text.