reproducibilityindex.ai

MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers

Authors: Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, Ming Zhou

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results demonstrate that our monolingual model2 outperforms stateof-the-art baselines in different parameter size of student models. We conduct extensive experiments on downstream NLP tasks. We conduct monolingual and multilingual distillation experiments in different parameter size of student models.
Researcher Affiliation	Industry	Wenhui Wang Furu Wei Li Dong Hangbo Bao Nan Yang Ming Zhou Microsoft Research {wenwan,fuwei,lidong1,t-habao,nanya,mingzhou}@microsoft.com
Pseudocode	No	The paper describes methods in prose and with diagrams, but does not contain structured pseudocode or algorithm blocks.
Open Source Code	Yes	The code and models will be publicly available at https://aka.ms/minilm.
Open Datasets	Yes	We use documents of English Wikipedia3 and Book Corpus [49] for the pre-training data, following the preprocess and the Word Piece tokenization of Devlin et al. [12]. We evaluate on SQu AD 2.0 [31], which has served as a major question answering benchmark. GLUE The General Language Understanding Evaluation benchmark [44] consists of nine sentencelevel classiﬁcation tasks. We evaluate the student models on cross-lingual natural language inference (XNLI) benchmark [9] and cross-lingual question answering (MLQA) benchmark [24].
Dataset Splits	No	The paper mentions evaluating on 'dev sets' for various benchmarks (e.g., GLUE, MNLI, SST-2, SQuAD 2.0) and using 'MLQA English development data for early stopping'. While this implies the use of validation data, it does not explicitly provide the specific dataset splits (percentages or counts) used by the authors for these experiments, relying on the reader's knowledge of the benchmark's standard splits.
Hardware Specification	No	The paper does not provide specific hardware details (exact GPU/CPU models, processor types, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies	No	The paper does not provide specific ancillary software details, such as library names with version numbers, needed to replicate the experiment.
Experiment Setup	No	The paper mentions that models are 'trained using the same data and hyper-parameters' as a baseline, but does not explicitly list the specific hyperparameter values (e.g., learning rate, batch size, number of epochs, optimizer settings) or system-level training settings used for its own experiments in the main text.