Understanding and Improving Layer Normalization
Authors: Jingjing Xu, Xu Sun, Zhiyuan Zhang, Guangxiang Zhao, Junyang Lin
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show that a simple version of Layer Norm (Layer Norm-simple) without the bias and gain outperforms Layer Norm on four datasets. It obtains the state-of-the-art performance on En-Vi machine translation. To investigate how Layer Norm works, we conduct a series of experiments in this paper. |
| Researcher Affiliation | Academia | Jingjing Xu1, Xu Sun1,2 , Zhiyuan Zhang1, Guangxiang Zhao2, Junyang Lin1 1 MOE Key Lab of Computational Linguistics, School of EECS, Peking University 2 Center for Data Science, Peking University {jingjingxu,xusun,zzy1210,zhaoguangxiang,linjunyang}@pku.edu.cn |
| Pseudocode | No | The paper describes algorithms through mathematical equations (e.g., Eq. 1, 2, 4, 6, 9) but does not provide any explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | Yes | Our code is released at https://github.com/lancopku/Ada Norm |
| Open Datasets | Yes | Machine translation includes three widely-used datasets, WMT English-German (En-De), IWSLT 14 German-English (De-En) [Cettolo et al., 2014] and IWSLT 15 English-Vietnamese (En-Vi) [Cettolo et al., 2015]. Language modeling includes a large dataset, Enwiki83 that contains 100M bytes of unprocessed Wikipedia text. http://mattmahoney.net/dc/text.html |
| Dataset Splits | Yes | The De-En dataset... It contains 153K sentences for training, 7K sentences for validation, and 7K sentences for testing. The En-Vi dataset contains 133K training sentence pairs... We use TED tst2012 (1,553 sentences) as the validation set and TED tst2013 (1,268 sentences) as the test set. |
| Hardware Specification | No | The paper describes various model architectures (e.g., '12-layer Transformer-XL model', '3-layer convolutional neural network') and computational settings like batch sizes and learning rates, but it does not specify any particular hardware components such as GPU or CPU models used for the experiments. |
| Software Dependencies | No | The paper mentions using 'Fairseq' and implicitly 'PyTorch' as an underlying framework but does not provide specific version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | For En-De dataset, the dropout rate is 0.3. The learning rate is 0.001. The training batch size is 4,096 tokens. We use optimizer Adam with β1 = 0.9 and β2 = 0.98. The number of warmup steps is 4K. The De-En dataset... The initialization learning rate is 1e-07 and the learning rate is 0.0015. The training batch size is 4,096 tokens. |