Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Infinite-Width Limit of a Single Attention Layer: Analysis via Tensor Programs

Authors: Mana Sakai, Ryo Karakida, Masaaki Imaizumi

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Numerical experiments validate our theoretical predictions, confirming the effectiveness of our theory at finite width and accurate description of finite-head attentions. [...] We perform simulations to validate the infinite-width limit distributions derived in Example 3.1.
Researcher Affiliation Academia Mana Sakai1,3 Ryo Karakida2,3 Masaaki Imaizumi1,3 1The University of Tokyo 2National Institute of Advanced Industrial Science and Technology 3RIKEN Center for Advanced Intelligence Project
Pseudocode Yes Algorithm 1 Multi-Head Attention (Example 3.1) Input: {π‘₯𝑖}𝑖 [𝑠] R𝑛input vectors for a sequence of length 𝑠 Input: {π‘Šπ‘„,π‘Ž, π‘ŠπΎ,π‘Ž, π‘Šπ‘‰,π‘Ž, π‘Šπ‘‚,π‘Ž}π‘Ž [𝐻] R𝑛 𝑛weight matrices for 𝐻heads
Open Source Code Yes All simulation codes are available at https://github.com/manasakai/infinite-width-attention.
Open Datasets No The paper describes generating data for its simulations rather than using pre-existing open datasets. For instance, in Appendix B.1, it states: "Each element of the initial vector β„Ž R𝑛is sampled independently from a standard normal distribution." No specific external dataset or repository link is provided.
Dataset Splits No The paper performs simulations by generating samples rather than using pre-defined dataset splits. In Appendix B.1, it states: "To estimate the empirical distributions of finite-width attention outputs and their corresponding infinite-width limits, we employ Monte Carlo sampling. For each such estimation, 50,000 samples are drawn, unless otherwise noted." It does not describe training, validation, or test splits for a specific dataset.
Hardware Specification No Our experiments are small-scale and implementable by a small laptop. Also, we do not pursue the computational cost in this study, so the computational resource is out of our focus.
Software Dependencies No The paper does not specify any software dependencies with version numbers, such as Python versions, specific libraries (e.g., PyTorch, TensorFlow), or their corresponding versions.
Experiment Setup Yes Unless otherwise noted, the simulations presented in this paper set the spatial dimension to 𝑠= 4. The core experimental setup follows that described in Example 3.1, which is outlined in Algorithm 1. [...] Each element of the initial vector β„Ž R𝑛is sampled independently from a standard normal distribution. For all weight matrices involved in the attention mechanism π‘Šπ‘„,π‘Ž, π‘ŠπΎ,π‘Ž, π‘Šπ‘‰,π‘Ž, π‘Šπ‘‚,π‘Žand the matrices π‘Šπ‘–generating π‘₯𝑖, we set 𝜎2 π‘Šπ‘„,π‘Ž= 𝜎2 π‘ŠπΎ,π‘Ž= 𝜎2 π‘Šπ‘‰,π‘Ž= 𝜎2 π‘Šπ‘‚,π‘Ž= 𝜎2 π‘Šπ‘–= 1. [...] In our experiments, we set 𝐢= 100. [...] For each such estimation, 50,000 samples are drawn, unless otherwise noted. Kernel density estimation (KDE) is used to visualize these empirical distributions.