Reducing Transformer Depth on Demand with Structured Dropout

Authors: Angela Fan, Edouard Grave, Armand Joulin

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the effectiveness of our approach by improving the state of the art on machine translation, language modeling, summarization, question answering, and language understanding benchmarks.
Researcher Affiliation Collaboration Angela Fan Facebook AI Research/LORIA angelafan@fb.com Edouard Grave Facebook AI Research egrave@fb.com Armand Joulin Facebook AI Research ajoulin@fb.com
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes Our models are implemented in Py Torch using fairseq-py (Ott et al., 2019)1. 1https://github.com/pytorch/fairseq/tree/master/examples/layerdrop
Open Datasets Yes We validate our findings on a variety of competitive benchmarks, namely WMT14 English German for machine translation, Wiki Text-103 (Merity et al., 2016) for language modeling, CNNDailymail (Hermann et al., 2015) for abstractive summarization, ELI5 (Fan et al., 2017) for long form question answering, and several natural language understanding tasks (Wang et al., 2019a) for sentence representation.
Dataset Splits Yes We experiment on the WMT English-German machine translation benchmark using the Transformer Big architecture. We use the dataset of 4.5M en-de sentence pairs from WMT16 (Vaswani et al., 2017) for training, newstest2013 for validation, and newstest2014 for test.
Hardware Specification Yes The words per second were computed on 8 V100 GPUs with 32GB of memory, without floating point 16, for a 16 layer model trained on Wikitext-103.
Software Dependencies No The paper mentions 'Py Torch using fairseq-py (Ott et al., 2019)' and 'Adam' as software used, but it does not specify version numbers for any of these components.
Experiment Setup Yes Table 5: Hyperparameters for Ro BERTa Pretraining (Number of Layers, Hidden Size, FFN Size, Attention Heads, Layer Drop, Warmup Steps, Peak Learning Rate, Batch Size). Also, 'We optimize the dropout value within the range {0.1, 0.2, 0.5} on the validation set and set the Layer Drop rate p to 0.2.'