Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Limitations of Normalization in Attention

Authors: Timur Mudarisov, Mikhail Burtsev, Tatiana Petrova, Radu State

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through experiments with pre-trained GPT-2 model, we empirically validate our theoretical results and analyze key behaviors of the attention mechanism. Our analysis includes explicit bounds on distances and separation criteria for token vectors under softmax scaling. Through experiments with pre-trained GPT-2 model, we empirically validate our theoretical results and analyze key behaviors of the attention mechanism. Notably, we demonstrate that as the number of selected tokens increases, the model s ability to distinguish informative tokens declines, often converging toward a uniform selection pattern. We also show that gradient sensitivity under softmax normalization presents challenges during training, especially at low temperature settings.
Researcher Affiliation Academia Timur Mudarisov University of Luxembourg Luxembourg EMAIL Mikhail Burtsev London Institute for Mathematical Sciences EMAIL Tatiana Petrova University of Luxembourg Luxembourg EMAIL Radu State University of Luxembourg Luxembourg EMAIL
Pseudocode Yes Algorithm 1 Distance Analysis. Different L and fixed N. Algorithm 2 Distance Analysis. Different top-N and fixed L Algorithm 3 Distance Analysis. Critical Top-N detection. Algorithm 4 Geometrical analysis. Separation Ratio and Bounds for Top-N Attention Tokens. Algorithm 5 Gradient Sensitivity Analysis
Open Source Code No After the blind review process, we will share the corresponding scripts.
Open Datasets Yes We evaluate our theoretical findings on the publicly available GPT-2 model family1 [15]. All text is tokenised with byte-pair encoding (BPE) [6] as implemented in the Hugging Face transformers library. Unless otherwise stated, the input consists of consecutive chapters from War and Peace by Leo Tolstoy (public domain), providing long-form prose well beyond the model s context window.
Dataset Splits Yes Two complementary experiments are performed: 1. Scaling with sequence length. Fix N = 5 and vary L {32, . . . , 1024}. 2. Scaling with top-N. Fix L = 1024 and vary N {1, 5, 10, 20, 100}. For each configuration we compute, across all 144 GPT-2 heads/layers, (i) the true distance d (7), (ii) the expectation term of Theorem 1, and (iii) the analytic upper bound. In addition, we estimate a critical top-N value: the smallest N for which the empirical and expected distance distributions are indistinguishable under a two-sample Kolmogorov Smirnov test (α = 0.01).
Hardware Specification Yes For the given research, we used the Apple M1 Pro chip with a 10-core CPU and 16GB of unified memory, based on ARM architecture.
Software Dependencies No The models were implemented and examined using Py Torch [14], running on the Apple M1 Pro s ARM-based CPU architecture to ensure efficient computation. For the parallelization procedure, we used Joblib library [9].
Experiment Setup Yes Unless otherwise stated, the input consists of consecutive chapters from War and Peace by Leo Tolstoy (public domain), providing long-form prose well beyond the model s context window. For every layer and attention head, we extract the full attention matrix A RL L and the associated query, key, and value tensors, enabling direct comparison with our distance and geometry metrics. Implementation details, hyperparameters, and reproducibility scripts are included in Appendix B. For each sequence we set r = min i/ IN s αixi 2, so that every non-selected token lies outside the ball Br(s). Maximum finite-difference Jacobian norm g(T, ε) for three perturbation magnitudes (coloured curves, log log scale). For each head layer pair we evaluate the finite-difference Jacobian norm g(T, ε) = 1 ε α ℓ+ε ℓ α ℓ 2, ℓ 2 = 1, which approximates ℓα 2. Full implementation details are provided in Appendix B. Figure 4 shows the maximum value of g(T, ε) across all 144 heads/layers of GPT-2 for ε {10 3, 10 1, 10}.