On the Nonlinearity of Layer Normalization

Authors: Yunhao Ni, Yuxin Guo, Junlong Jia, Lei Huang

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The experiments are conducted on CIFAR-10 and MNIST with random label assigned (CIFAR-10-RL and MNIST-RL). We evaluate the classification accuracy on the training set after the model is trained, which indicates that the capacity of models in fitting dataset empirically. We only provide essential components of the experimental setup; for more details, please refer to the Appendix F.1.
Researcher Affiliation Academia 1SKLCCSE, Institute of Artificial Intelligence, Beihang University, Beijing, China.
Pseudocode Yes Algorithm 1 Projection Merge Algorithm
Open Source Code No No explicit statement or link to open-source code for the methodology described in this paper was found.
Open Datasets Yes The experiments are conducted on CIFAR-10 and MNIST with random label assigned (CIFAR-10-RL and MNIST-RL).
Dataset Splits No The experiments are conducted on CIFAR-10 and MNIST with random label assigned (CIFAR-10-RL and MNIST-RL). We evaluate the classification accuracy on the training set after the model is trained, which indicates that the capacity of models in fitting dataset empirically.
Hardware Specification No We only provide essential components of the experimental setup; for more details, please refer to the Appendix F.1.
Software Dependencies No We conduct experiments to apply LN-G on Transformer (Vaswani et al., 2017) (where LN is the default normalization) for machine translation tasks using fairseq-py (Ott et al., 2019).
Experiment Setup Yes For the training of liner classifier, we apply both SGD optimizer with momentum (0.1) and Adam optimizer with betas (0.9, 0.999). We train the model for 150 epochs and use a learning rate schedule with a decay 0.5 per 20 epochs. We search the batch sizes ranging in {128, 256}, the initial learning rates ranging in {0.001, 0.003, 0.005, 0.008, 0.05, 0.08, 0.1, 0.15} and 5 random seeds, and report the best accuracy from these configurations of hyper-parameters.