Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Attention Mechanism, Max-Affine Partition, and Universal Approximation

Authors: Hude Liu, Jerry Yao-Chieh Hu, Zhao Song, Han Liu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We establish the universal approximation capability of single-layer, single-head selfand cross-attention mechanisms with minimal attached structures. Our key insight is to interpret single-head attention as an input domain-partition mechanism that assigns distinct values to subregions. This allows us to engineer the attention weights such that this assignment imitates the target function. Building on this, we prove that a single self-attention layer, preceded by sum-of-linear transformations, is capable of approximating any continuous function on a compact domain under the L -norm. Furthermore, we extend this construction to approximate any Lebesgue integrable function under Lp-norm for 1 p < . Lastly, we also extend our techniques and show that, for the first time, single-head cross-attention achieves the same universal approximation guarantees. 5 Concluding Remarks Numerical validations backup our theory in Appendix B. B Proof-of-Concept Experiments Figure 2: Scale of Attention Weights vs. Training noise. For MNIST, CIFAR-10, and Fashion-MNIST we plot the ℓ2-norm of WK and WQ against the injected label-noise ratio. In all three datasets the weight scale declines monotonically as noise increases, corroborating Proposition 3.2: higher noise hampers precise partitioning, so the model reduces the magnitude of weights that form the attention score matrix.
Researcher Affiliation Academia Hude Liu Jerry Yao-Chieh Hu Zhao Song Han Liu Center for Foundation Models and Generative AI & Department of Computer Science, Simons Institute for the Theory of Computing, UC Berkeley, Berkeley, CA 94720, USA Department of Statistics and Data Science, Northwestern University, Evanston, IL 60208, USA EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes methods and proofs using mathematical notation and conceptual steps (e.g., 'Overview of Proof Strategy' in Section 4.1), but it does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes Appendix B specifies datasets, noise-injection protocol, model size, and training procedure; code will be released anonymously with the supplementary material.
Open Datasets Yes B Proof-of-Concept Experiments Data. We perform separate experiments on the training set of the noised MNIST, CIFAR10 and Fashion MNIST datasets with noise level (the coefficient multiplying the standard Gaussian noise) gradually adding from 0 to 0.72 by the step size of 0.03.
Dataset Splits No Appendix B mentions 'training set of the noised MNIST, CIFAR10 and Fashion MNIST datasets' but does not specify the train/test/validation splits (e.g., percentages or sample counts) or how the datasets were partitioned beyond being a 'training set'.
Hardware Specification No 8. Experiments compute resources Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [No] Justification: The proof-of-concept experiments run on a single commodity GPU, but exact hardware specifications and wall-clock times are not reported.
Software Dependencies No 5. Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: All datasets (MNIST, CIFAR-10, Fashion-MNIST) are public; an anonymized Py Torch implementation and run scripts will be included in the supplemental ZIP.
Experiment Setup No Network setups. Our network consists of a single-head self-attention followed by a feed-forward network. Due to the complexity and different characteristics of the selected datasets, the size of the feed-forward network slightly differs between datasets.