Root Mean Square Layer Normalization
Authors: Biao Zhang, Rico Sennrich
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on several tasks using diverse network architectures show that RMSNorm achieves comparable performance against Layer Norm but reduces the running time by 7% 64% on different models. |
| Researcher Affiliation | Academia | Biao Zhang1 Rico Sennrich2,1 1School of Informatics, University of Edinburgh 2Institute of Computational Linguistics, University of Zurich B.Zhang@ed.ac.uk, sennrich@cl.uzh.ch |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Source code is available at https://github.com/bzhang Go/rmsnorm. |
| Open Datasets | Yes | We train two different models, a GRU-based RNNSearch [4] and a self-attention based neural Transformer [31] on WMT14 English-German translation task. We train an order-embedding model (OE) proposed by Vendrov et al. [32] on the Microsoft COCO dataset [17] using their public source code in Theano. CIFAR-10 is a supervised image classification task, with 10 different classes. |
| Dataset Splits | Yes | We train two different models... on WMT14 English-German translation task. We use the newstest2013 dataset. We train an order-embedding model... on the Microsoft COCO dataset [17]. We train a modified version of the Conv Pool-CNN-C architecture [15], and follow the same experimental protocol as Salimans and Kingma [22]. |
| Hardware Specification | Yes | Unless otherwise noted, all speed-related statistics are measured on one TITAN X (Pascal). Time : the time in second per 1k training steps, which is measured using Tesla V100. Time is measured with Ge Force RTX 2080 Ti. |
| Software Dependencies | No | The paper mentions using Tensorflow, Py Torch, and Theano but does not specify their version numbers. |
| Experiment Setup | No | The paper references external papers for experimental protocols (e.g., 'employ the base setting as in [31]', 'follow the same experimental protocol as Salimans and Kingma [22]') but does not explicitly list concrete hyperparameter values or detailed training configurations within its own main text. |