reproducibilityindex.ai

Reducing Transformer Depth on Demand with Structured Dropout

Authors: Angela Fan, Edouard Grave, Armand Joulin

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the effectiveness of our approach by improving the state of the art on machine translation, language modeling, summarization, question answering, and language understanding benchmarks.
Researcher Affiliation	Collaboration	Angela Fan Facebook AI Research/LORIA angelafan@fb.com Edouard Grave Facebook AI Research egrave@fb.com Armand Joulin Facebook AI Research ajoulin@fb.com
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Our models are implemented in Py Torch using fairseq-py (Ott et al., 2019)1. 1https://github.com/pytorch/fairseq/tree/master/examples/layerdrop
Open Datasets	Yes	We validate our ﬁndings on a variety of competitive benchmarks, namely WMT14 English German for machine translation, Wiki Text-103 (Merity et al., 2016) for language modeling, CNNDailymail (Hermann et al., 2015) for abstractive summarization, ELI5 (Fan et al., 2017) for long form question answering, and several natural language understanding tasks (Wang et al., 2019a) for sentence representation.
Dataset Splits	Yes	We experiment on the WMT English-German machine translation benchmark using the Transformer Big architecture. We use the dataset of 4.5M en-de sentence pairs from WMT16 (Vaswani et al., 2017) for training, newstest2013 for validation, and newstest2014 for test.
Hardware Specification	Yes	The words per second were computed on 8 V100 GPUs with 32GB of memory, without ﬂoating point 16, for a 16 layer model trained on Wikitext-103.
Software Dependencies	No	The paper mentions 'Py Torch using fairseq-py (Ott et al., 2019)' and 'Adam' as software used, but it does not specify version numbers for any of these components.
Experiment Setup	Yes	Table 5: Hyperparameters for Ro BERTa Pretraining (Number of Layers, Hidden Size, FFN Size, Attention Heads, Layer Drop, Warmup Steps, Peak Learning Rate, Batch Size). Also, 'We optimize the dropout value within the range {0.1, 0.2, 0.5} on the validation set and set the Layer Drop rate p to 0.2.'