reproducibilityindex.ai

PowerNorm: Rethinking Batch Normalization in Transformers

Authors: Sheng Shen, Zhewei Yao, Amir Gholami, Michael Mahoney, Kurt Keutzer

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We extensively test PN for transformers on a range of NLP tasks, and we show that it signiﬁcantly outperforms both LN and BN. In particular, PN outperforms LN by 0.4/0.6 BLEU on IWSLT14/WMT14 and 5.6/3.0 PPL on PTB/Wiki Text-103.
Researcher Affiliation	Academia	1UC Berkeley. Correspondence to: Amir Gholami <amirgh@berkeley.edu>.
Pseudocode	Yes	Algorithm 1 Batch Normalization (Every Iteration) and Algorithm 2 Power Normalization (Every Iteration)
Open Source Code	Yes	We make our code publicly available at https://github.com/ s Incerass/powernorm.
Open Datasets	Yes	We evaluate our methods on two widely used public datasets: IWSLT14 German-to-English (De-En) and WMT14 English-to-German (En De) dataset. ... We experiment on both PTB (Mikolov et al., 2011) and Wikitext-103 (Merity et al., 2017)
Dataset Splits	Yes	The training/validation/test sets for the IWSLT14 dataset contain about 153K/7K/7K sentence pairs, respectively. ... Newstest2014 is used as the test set, and Newstest2013 is used as the validation set. ... PTB (Mikolov et al., 2011) has 0.93M training tokens, 0.073M validation words, and 0.082M test word.
Hardware Specification	No	The paper acknowledges support from 'Google Cloud' and 'Amazon AWS' but does not specify any particular CPU, GPU, or other hardware models used for the experiments.
Software Dependencies	No	We implement our code for MT using fairseq-py (Ott et al., 2019), and (Ma et al., 2019) for LM tasks. ... We use the Adam optimizer. The paper mentions software tools like 'fairseq-py' and 'Adam optimizer' but does not provide specific version numbers for these or other software dependencies.
Experiment Setup	Yes	For all the experiments, we use the pre-normalization setting in (Wang et al., 2019), where the normalization layer is located right before the multi-head attention module and point-wise feed-forward network module. Following (Wang et al., 2019), we generally increase the learning rate by a factor of 2.0, relative to the common post-normalization transformer (Vaswani et al., 2017). ... All the other hyperparamters (learning rate, dropout, weight decay, warmup steps, etc.) are set identically to the ones reported in the literature for LN (i.e., we use the same hyperparameters for BN/PN). ... We set dropout as 0.3/0.0 for Transformer big/small model, respectively. We use the Adam optimizer and follow the optimizer setting and learning rate schedule in (Wang et al., 2019). We set the maximum number of updates following (Ott et al., 2018) to be 300k for WMT and 100k for IWSLT. ... We employ label smoothing of value ls 0.1 in all experiments.