reproducibilityindex.ai

Gating Dropout: Communication-efficient Regularization for Sparsely Activated Transformers

Authors: Rui Liu, Young Jin Kim, Alexandre Muzio, Hany Hassan

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We validate the effectiveness of Gating Dropout on multilingual machine translation tasks. Our results demonstrate that Gating Dropout improves a state-of-the-art Mo E model (Kim et al., 2021) with faster wallclock time convergence rates and better BLEU scores for a variety of model sizes and datasets.
Researcher Affiliation	Collaboration	1University of Michigan, Ann Arbor 2Microsoft.
Pseudocode	No	The paper describes the proposed Gating Dropout technique in prose, but it does not include any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Our code will be available at https://aka.ms/gating_dropout.
Open Datasets	Yes	WMT-10: We use WMT-10 benchmark dataset which includes bitext data between English and other 10 languages (Wang et al., 2020). [...] Web-50: We use a dataset composed of an in-house web-crawled parallel dataset augmented with data from CCNet (Wenzek et al., 2019).
Dataset Splits	Yes	Generalization Performance: BLEU score on a holdout validation set.
Hardware Specification	Yes	Specifically, we use a cluster of NVIDIA V100 GPUs connected via a 100Gb/s Infini Band fabric. [...] We use a cluster of NVIDIA V100 GPUs connected via a 100Gb/s Inifi Band fabric in most experiments. For some experiments on the Web-50 dataset, we run jobs on a cluster of NVIDIA A100 GPUs connected via a 1.6Tb/s Inifi Band fabric.
Software Dependencies	No	The paper mentions that 'All of our implementations are based on Deep Speed library', but it does not specify any version numbers for Deep Speed or any other software dependencies.
Experiment Setup	Yes	The capacity factor is set to 1.0 during training and 2.0 during testing (Fedus et al., 2021). Jittering noise is applied to the token representation right before the gating network. An additional balancing loss with multiplicative coefficient 0.01 is used to better balance the utilization among different experts. [...] We use Adam as the optimizer with β1 = 0.9 and β = 0.99. The learning rate is set 0.03 with 5000 warm-up steps and inverse square root scheduler as proposed in Raffel et al. (2019). We set the batch size to be equivalent to 435k tokens. For the gating dropout rate p, we set it to 0.3 for Gate-Drop and 0.2 for Gate-Expert-Drop by default, which are chosen because of their good performance (see Section 4.4).