Gating Dropout: Communication-efficient Regularization for Sparsely Activated Transformers

Authors: Rui Liu, Young Jin Kim, Alexandre Muzio, Hany Hassan

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate the effectiveness of Gating Dropout on multilingual machine translation tasks. Our results demonstrate that Gating Dropout improves a state-of-the-art Mo E model (Kim et al., 2021) with faster wallclock time convergence rates and better BLEU scores for a variety of model sizes and datasets.
Researcher Affiliation Collaboration 1University of Michigan, Ann Arbor 2Microsoft.
Pseudocode No The paper describes the proposed Gating Dropout technique in prose, but it does not include any structured pseudocode or algorithm blocks.
Open Source Code Yes Our code will be available at https://aka.ms/gating_dropout.
Open Datasets Yes WMT-10: We use WMT-10 benchmark dataset which includes bitext data between English and other 10 languages (Wang et al., 2020). [...] Web-50: We use a dataset composed of an in-house web-crawled parallel dataset augmented with data from CCNet (Wenzek et al., 2019).
Dataset Splits Yes Generalization Performance: BLEU score on a holdout validation set.
Hardware Specification Yes Specifically, we use a cluster of NVIDIA V100 GPUs connected via a 100Gb/s Infini Band fabric. [...] We use a cluster of NVIDIA V100 GPUs connected via a 100Gb/s Inifi Band fabric in most experiments. For some experiments on the Web-50 dataset, we run jobs on a cluster of NVIDIA A100 GPUs connected via a 1.6Tb/s Inifi Band fabric.
Software Dependencies No The paper mentions that 'All of our implementations are based on Deep Speed library', but it does not specify any version numbers for Deep Speed or any other software dependencies.
Experiment Setup Yes The capacity factor is set to 1.0 during training and 2.0 during testing (Fedus et al., 2021). Jittering noise is applied to the token representation right before the gating network. An additional balancing loss with multiplicative coefficient 0.01 is used to better balance the utilization among different experts. [...] We use Adam as the optimizer with β1 = 0.9 and β = 0.99. The learning rate is set 0.03 with 5000 warm-up steps and inverse square root scheduler as proposed in Raffel et al. (2019). We set the batch size to be equivalent to 435k tokens. For the gating dropout rate p, we set it to 0.3 for Gate-Drop and 0.2 for Gate-Expert-Drop by default, which are chosen because of their good performance (see Section 4.4).