Root Mean Square Layer Normalization

Authors: Biao Zhang, Rico Sennrich

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on several tasks using diverse network architectures show that RMSNorm achieves comparable performance against Layer Norm but reduces the running time by 7% 64% on different models.
Researcher Affiliation Academia Biao Zhang1 Rico Sennrich2,1 1School of Informatics, University of Edinburgh 2Institute of Computational Linguistics, University of Zurich B.Zhang@ed.ac.uk, sennrich@cl.uzh.ch
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes Source code is available at https://github.com/bzhang Go/rmsnorm.
Open Datasets Yes We train two different models, a GRU-based RNNSearch [4] and a self-attention based neural Transformer [31] on WMT14 English-German translation task. We train an order-embedding model (OE) proposed by Vendrov et al. [32] on the Microsoft COCO dataset [17] using their public source code in Theano. CIFAR-10 is a supervised image classification task, with 10 different classes.
Dataset Splits Yes We train two different models... on WMT14 English-German translation task. We use the newstest2013 dataset. We train an order-embedding model... on the Microsoft COCO dataset [17]. We train a modified version of the Conv Pool-CNN-C architecture [15], and follow the same experimental protocol as Salimans and Kingma [22].
Hardware Specification Yes Unless otherwise noted, all speed-related statistics are measured on one TITAN X (Pascal). Time : the time in second per 1k training steps, which is measured using Tesla V100. Time is measured with Ge Force RTX 2080 Ti.
Software Dependencies No The paper mentions using Tensorflow, Py Torch, and Theano but does not specify their version numbers.
Experiment Setup No The paper references external papers for experimental protocols (e.g., 'employ the base setting as in [31]', 'follow the same experimental protocol as Salimans and Kingma [22]') but does not explicitly list concrete hyperparameter values or detailed training configurations within its own main text.