PowerNorm: Rethinking Batch Normalization in Transformers
Authors: Sheng Shen, Zhewei Yao, Amir Gholami, Michael Mahoney, Kurt Keutzer
ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We extensively test PN for transformers on a range of NLP tasks, and we show that it significantly outperforms both LN and BN. In particular, PN outperforms LN by 0.4/0.6 BLEU on IWSLT14/WMT14 and 5.6/3.0 PPL on PTB/Wiki Text-103. |
| Researcher Affiliation | Academia | 1UC Berkeley. Correspondence to: Amir Gholami <amirgh@berkeley.edu>. |
| Pseudocode | Yes | Algorithm 1 Batch Normalization (Every Iteration) and Algorithm 2 Power Normalization (Every Iteration) |
| Open Source Code | Yes | We make our code publicly available at https://github.com/ s Incerass/powernorm. |
| Open Datasets | Yes | We evaluate our methods on two widely used public datasets: IWSLT14 German-to-English (De-En) and WMT14 English-to-German (En De) dataset. ... We experiment on both PTB (Mikolov et al., 2011) and Wikitext-103 (Merity et al., 2017) |
| Dataset Splits | Yes | The training/validation/test sets for the IWSLT14 dataset contain about 153K/7K/7K sentence pairs, respectively. ... Newstest2014 is used as the test set, and Newstest2013 is used as the validation set. ... PTB (Mikolov et al., 2011) has 0.93M training tokens, 0.073M validation words, and 0.082M test word. |
| Hardware Specification | No | The paper acknowledges support from 'Google Cloud' and 'Amazon AWS' but does not specify any particular CPU, GPU, or other hardware models used for the experiments. |
| Software Dependencies | No | We implement our code for MT using fairseq-py (Ott et al., 2019), and (Ma et al., 2019) for LM tasks. ... We use the Adam optimizer. The paper mentions software tools like 'fairseq-py' and 'Adam optimizer' but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | For all the experiments, we use the pre-normalization setting in (Wang et al., 2019), where the normalization layer is located right before the multi-head attention module and point-wise feed-forward network module. Following (Wang et al., 2019), we generally increase the learning rate by a factor of 2.0, relative to the common post-normalization transformer (Vaswani et al., 2017). ... All the other hyperparamters (learning rate, dropout, weight decay, warmup steps, etc.) are set identically to the ones reported in the literature for LN (i.e., we use the same hyperparameters for BN/PN). ... We set dropout as 0.3/0.0 for Transformer big/small model, respectively. We use the Adam optimizer and follow the optimizer setting and learning rate schedule in (Wang et al., 2019). We set the maximum number of updates following (Ott et al., 2018) to be 300k for WMT and 100k for IWSLT. ... We employ label smoothing of value ls 0.1 in all experiments. |