Understanding and Improving Layer Normalization

Authors: Jingjing Xu, Xu Sun, Zhiyuan Zhang, Guangxiang Zhao, Junyang Lin

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show that a simple version of Layer Norm (Layer Norm-simple) without the bias and gain outperforms Layer Norm on four datasets. It obtains the state-of-the-art performance on En-Vi machine translation. To investigate how Layer Norm works, we conduct a series of experiments in this paper.
Researcher Affiliation Academia Jingjing Xu1, Xu Sun1,2 , Zhiyuan Zhang1, Guangxiang Zhao2, Junyang Lin1 1 MOE Key Lab of Computational Linguistics, School of EECS, Peking University 2 Center for Data Science, Peking University {jingjingxu,xusun,zzy1210,zhaoguangxiang,linjunyang}@pku.edu.cn
Pseudocode No The paper describes algorithms through mathematical equations (e.g., Eq. 1, 2, 4, 6, 9) but does not provide any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes Our code is released at https://github.com/lancopku/Ada Norm
Open Datasets Yes Machine translation includes three widely-used datasets, WMT English-German (En-De), IWSLT 14 German-English (De-En) [Cettolo et al., 2014] and IWSLT 15 English-Vietnamese (En-Vi) [Cettolo et al., 2015]. Language modeling includes a large dataset, Enwiki83 that contains 100M bytes of unprocessed Wikipedia text. http://mattmahoney.net/dc/text.html
Dataset Splits Yes The De-En dataset... It contains 153K sentences for training, 7K sentences for validation, and 7K sentences for testing. The En-Vi dataset contains 133K training sentence pairs... We use TED tst2012 (1,553 sentences) as the validation set and TED tst2013 (1,268 sentences) as the test set.
Hardware Specification No The paper describes various model architectures (e.g., '12-layer Transformer-XL model', '3-layer convolutional neural network') and computational settings like batch sizes and learning rates, but it does not specify any particular hardware components such as GPU or CPU models used for the experiments.
Software Dependencies No The paper mentions using 'Fairseq' and implicitly 'PyTorch' as an underlying framework but does not provide specific version numbers for these or any other software dependencies.
Experiment Setup Yes For En-De dataset, the dropout rate is 0.3. The learning rate is 0.001. The training batch size is 4,096 tokens. We use optimizer Adam with β1 = 0.9 and β2 = 0.98. The number of warmup steps is 4K. The De-En dataset... The initialization learning rate is 1e-07 and the learning rate is 0.0015. The training batch size is 4,096 tokens.