Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers
Authors: Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, Ming Zhou
NeurIPS 2020 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results demonstrate that our monolingual model2 outperforms stateof-the-art baselines in different parameter size of student models. We conduct extensive experiments on downstream NLP tasks. We conduct monolingual and multilingual distillation experiments in different parameter size of student models. |
| Researcher Affiliation | Industry | Wenhui Wang Furu Wei Li Dong Hangbo Bao Nan Yang Ming Zhou Microsoft Research EMAIL |
| Pseudocode | No | The paper describes methods in prose and with diagrams, but does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code and models will be publicly available at https://aka.ms/minilm. |
| Open Datasets | Yes | We use documents of English Wikipedia3 and Book Corpus [49] for the pre-training data, following the preprocess and the Word Piece tokenization of Devlin et al. [12]. We evaluate on SQu AD 2.0 [31], which has served as a major question answering benchmark. GLUE The General Language Understanding Evaluation benchmark [44] consists of nine sentencelevel classification tasks. We evaluate the student models on cross-lingual natural language inference (XNLI) benchmark [9] and cross-lingual question answering (MLQA) benchmark [24]. |
| Dataset Splits | No | The paper mentions evaluating on 'dev sets' for various benchmarks (e.g., GLUE, MNLI, SST-2, SQuAD 2.0) and using 'MLQA English development data for early stopping'. While this implies the use of validation data, it does not explicitly provide the specific dataset splits (percentages or counts) used by the authors for these experiments, relying on the reader's knowledge of the benchmark's standard splits. |
| Hardware Specification | No | The paper does not provide specific hardware details (exact GPU/CPU models, processor types, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific ancillary software details, such as library names with version numbers, needed to replicate the experiment. |
| Experiment Setup | No | The paper mentions that models are 'trained using the same data and hyper-parameters' as a baseline, but does not explicitly list the specific hyperparameter values (e.g., learning rate, batch size, number of epochs, optimizer settings) or system-level training settings used for its own experiments in the main text. |