On the Nonlinearity of Layer Normalization
Authors: Yunhao Ni, Yuxin Guo, Junlong Jia, Lei Huang
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The experiments are conducted on CIFAR-10 and MNIST with random label assigned (CIFAR-10-RL and MNIST-RL). We evaluate the classification accuracy on the training set after the model is trained, which indicates that the capacity of models in fitting dataset empirically. We only provide essential components of the experimental setup; for more details, please refer to the Appendix F.1. |
| Researcher Affiliation | Academia | 1SKLCCSE, Institute of Artificial Intelligence, Beihang University, Beijing, China. |
| Pseudocode | Yes | Algorithm 1 Projection Merge Algorithm |
| Open Source Code | No | No explicit statement or link to open-source code for the methodology described in this paper was found. |
| Open Datasets | Yes | The experiments are conducted on CIFAR-10 and MNIST with random label assigned (CIFAR-10-RL and MNIST-RL). |
| Dataset Splits | No | The experiments are conducted on CIFAR-10 and MNIST with random label assigned (CIFAR-10-RL and MNIST-RL). We evaluate the classification accuracy on the training set after the model is trained, which indicates that the capacity of models in fitting dataset empirically. |
| Hardware Specification | No | We only provide essential components of the experimental setup; for more details, please refer to the Appendix F.1. |
| Software Dependencies | No | We conduct experiments to apply LN-G on Transformer (Vaswani et al., 2017) (where LN is the default normalization) for machine translation tasks using fairseq-py (Ott et al., 2019). |
| Experiment Setup | Yes | For the training of liner classifier, we apply both SGD optimizer with momentum (0.1) and Adam optimizer with betas (0.9, 0.999). We train the model for 150 epochs and use a learning rate schedule with a decay 0.5 per 20 epochs. We search the batch sizes ranging in {128, 256}, the initial learning rates ranging in {0.001, 0.003, 0.005, 0.008, 0.05, 0.08, 0.1, 0.15} and 5 random seeds, and report the best accuracy from these configurations of hyper-parameters. |