Reducing Transformer Depth on Demand with Structured Dropout
Authors: Angela Fan, Edouard Grave, Armand Joulin
ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the effectiveness of our approach by improving the state of the art on machine translation, language modeling, summarization, question answering, and language understanding benchmarks. |
| Researcher Affiliation | Collaboration | Angela Fan Facebook AI Research/LORIA angelafan@fb.com Edouard Grave Facebook AI Research egrave@fb.com Armand Joulin Facebook AI Research ajoulin@fb.com |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our models are implemented in Py Torch using fairseq-py (Ott et al., 2019)1. 1https://github.com/pytorch/fairseq/tree/master/examples/layerdrop |
| Open Datasets | Yes | We validate our findings on a variety of competitive benchmarks, namely WMT14 English German for machine translation, Wiki Text-103 (Merity et al., 2016) for language modeling, CNNDailymail (Hermann et al., 2015) for abstractive summarization, ELI5 (Fan et al., 2017) for long form question answering, and several natural language understanding tasks (Wang et al., 2019a) for sentence representation. |
| Dataset Splits | Yes | We experiment on the WMT English-German machine translation benchmark using the Transformer Big architecture. We use the dataset of 4.5M en-de sentence pairs from WMT16 (Vaswani et al., 2017) for training, newstest2013 for validation, and newstest2014 for test. |
| Hardware Specification | Yes | The words per second were computed on 8 V100 GPUs with 32GB of memory, without floating point 16, for a 16 layer model trained on Wikitext-103. |
| Software Dependencies | No | The paper mentions 'Py Torch using fairseq-py (Ott et al., 2019)' and 'Adam' as software used, but it does not specify version numbers for any of these components. |
| Experiment Setup | Yes | Table 5: Hyperparameters for Ro BERTa Pretraining (Number of Layers, Hidden Size, FFN Size, Attention Heads, Layer Drop, Warmup Steps, Peak Learning Rate, Batch Size). Also, 'We optimize the dropout value within the range {0.1, 0.2, 0.5} on the validation set and set the Layer Drop rate p to 0.2.' |