Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Just One Layer Norm Guarantees Stable Extrapolation
Authors: Juliusz Ziomek, George Whittle, Michael A Osborne
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we prove general results the first of their kind by applying Neural Tangent Kernel (NTK) theory to analyse infinitelywide neural networks trained until convergence and prove that the inclusion of just one Layer Norm (LN) fundamentally alters the induced NTK, transforming it into a bounded-variance kernel. As a result, the output of an infinitely wide network with at least one LN remains bounded, even on inputs far from the training data. In contrast, we show that a broad class of networks without LN can produce pathologically large outputs for certain inputs. We support these theoretical findings with empirical experiments on finite-width networks, demonstrating that while standard NNs often exhibit uncontrolled growth outside the training domain, a single LN layer effectively mitigates this instability. |
| Researcher Affiliation | Academia | Juliusz Ziomek , , George Whittle , Michael A. Osborne Machine Learning Research Group, University of Oxford Equal Contribution Corresponding Author {juliusz, george, mosb} @ robots.ox.ac.uk |
| Pseudocode | No | The paper primarily focuses on theoretical proofs, mathematical derivations (such as defining Layer Normalization, soft-cosine similarity, and various theorems related to NTK), and descriptions of experimental setups. It does not include any explicitly labeled pseudocode blocks or algorithms in a structured format. |
| Open Source Code | Yes | We open-source our codebase 2. 2https://github.com/Juliusz Ziomek/LN-NTK |
| Open Datasets | Yes | As the first real-world problem we study the UCI Physicochemical Properties of Protein Tertiary Structure dataset [28]... We utilise the UTKFace dataset3, a large-scale facial image dataset... 28 Prashant Rana. Physicochemical Properties of Protein Tertiary Structure. UCI Machine Learning Repository, 2013. DOI: https://doi.org/10.24432/C5QW3H. 3https://susanqq.github.io/UTKFace/ |
| Dataset Splits | Yes | As training set, we use randomly selected 90% of all proteins with surface area less than 20 thousand square angstroms. We then construct two validation sets, one in-domain with the remaining 10% of proteins with smaller surface area, and one out-of-domain with all proteins whose surface area is larger than 20 thousand square angstroms. |
| Hardware Specification | Yes | To run all experiments we used NVIDIA GeForce RTX 3090 with 24GB of memory. |
| Software Dependencies | No | We implemented all networks using PyTorch [27]. We utilised Adam optimiser [21] and MSE Loss for all experiments. For XGBoost we use the XGBoost Python package 5 with hyperparameter values set to default values and number of boosting rounds set to 100. While PyTorch and XGBoost are mentioned, specific version numbers for these software packages are not provided. |
| Experiment Setup | Yes | For details about compute and the exact hyperparameter settings, see Appendix J. ... All hidden layers have a size of 128. To be consistent with the theory, we initialised weights of fully-connected layers with Kaiming initialisation [16] and biases were sampled from N(0, σ2 b) with σ2 b = 0.01. For the UTK experiments in Section 4.3, a frozen Res Net-18 [17] is used before the fully-connected layers. We utilised Adam optimiser [21] and MSE Loss for all experiments. See Table 3 below for exact hyperparameters settings. Table 3: Hyperparameter values used throughout the experiments. Experiment Method Batch size Epochs Learning rate 4.1 All 100 (entire dataset) 3000 0.001 4.2 All 40132 (entire dataset) 2500 0.001 4.3 Standard NN 128 10 0.001 LN after 1st 128 10 0.001 LN after 2nd 128 10 0.003 LN after every 128 10 0.003 |