R-Drop: Regularized Dropout for Neural Networks
Authors: xiaobo liang, Lijun Wu, Juntao Li, Yue Wang, Qi Meng, Tao Qin, Wei Chen, Min Zhang, Tie-Yan Liu
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on 5 widely used deep learning tasks (18 datasets in total), including neural machine translation, abstractive summarization, language understanding, language modeling, and image classification, show that R-Drop is universally effective. |
| Researcher Affiliation | Collaboration | 1Soochow University, 2Microsoft Research Asia |
| Pseudocode | Yes | Algorithm 1 R-Drop Training Algorithm |
| Open Source Code | Yes | Our code is available at Git Hub2. 2https://github.com/dropreg/R-Drop |
| Open Datasets | Yes | Datasets The datasets of low-resource scenario are from IWSLT competitions, which include IWSLT14 English German (En De), English Spanish (En Es), and IWSLT17 English French (En Fr), English Chinese (En Zh) translations. The rich-resource datasets come from the widely acknowledged WMT translation tasks, and we take the WMT14 English German and English French tasks. The GLUE [61] benchmark... CNN/Daily Mail dataset originally introduced by Hermann et al. [22]... Wikitext-103 dataset [41]... CIFAR-100 [31] and the ILSVRC-2012 Image Net dataset [8]. |
| Dataset Splits | Yes | The IWSLT datasets contain about 170k training sentence pairs, 7k valid pairs, and 7k test pairs. The WMT data sizes are 4.5M, 36M for En De and En Fr respectively, valid and test data are from the corresponding newstest data. It contains 287,226 documents for training, 13,368 documents for validation and 11,490 documents for test. Same as [5], we report the perplexity on both valid and test sets. CIFAR-100 dataset consists of 60k images of 100 classes, and there are 600 images per class with 500 for training and 100 for testing. |
| Hardware Specification | No | The paper states, 'We provide the details in Appendix A.' (Question 3d in checklist) However, Appendix A is not provided in the given text. |
| Software Dependencies | No | The paper mentions using 'Fairseq [48]' but does not specify a version number for this or any other software dependency. |
| Experiment Setup | Yes | The weight α is set as 5 for all translation tasks. For each task, different random seeds and parameter settings are required, thus we dynamically adjust the coefficient α among {0.1, 0.5, 1.0} for each setting. In this task, the coefficient weight α is set as 0.7 to control the KL-divergence. We simply set the weight α to be 1.0 without tuning during training. During fine-tuning, the weight α is set as 0.6 for both models. We vary k in {1, 2, 5, 10}. Here we vary the α in {1, 3, 5, 7, 10} and conduct experiments. |