Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Infinite-Width Limit of a Single Attention Layer: Analysis via Tensor Programs
Authors: Mana Sakai, Ryo Karakida, Masaaki Imaizumi
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Numerical experiments validate our theoretical predictions, confirming the effectiveness of our theory at finite width and accurate description of finite-head attentions. [...] We perform simulations to validate the infinite-width limit distributions derived in Example 3.1. |
| Researcher Affiliation | Academia | Mana Sakai1,3 Ryo Karakida2,3 Masaaki Imaizumi1,3 1The University of Tokyo 2National Institute of Advanced Industrial Science and Technology 3RIKEN Center for Advanced Intelligence Project |
| Pseudocode | Yes | Algorithm 1 Multi-Head Attention (Example 3.1) Input: {π₯π}π [π ] Rπinput vectors for a sequence of length π Input: {ππ,π, ππΎ,π, ππ,π, ππ,π}π [π»] Rπ πweight matrices for π»heads |
| Open Source Code | Yes | All simulation codes are available at https://github.com/manasakai/infinite-width-attention. |
| Open Datasets | No | The paper describes generating data for its simulations rather than using pre-existing open datasets. For instance, in Appendix B.1, it states: "Each element of the initial vector β Rπis sampled independently from a standard normal distribution." No specific external dataset or repository link is provided. |
| Dataset Splits | No | The paper performs simulations by generating samples rather than using pre-defined dataset splits. In Appendix B.1, it states: "To estimate the empirical distributions of finite-width attention outputs and their corresponding infinite-width limits, we employ Monte Carlo sampling. For each such estimation, 50,000 samples are drawn, unless otherwise noted." It does not describe training, validation, or test splits for a specific dataset. |
| Hardware Specification | No | Our experiments are small-scale and implementable by a small laptop. Also, we do not pursue the computational cost in this study, so the computational resource is out of our focus. |
| Software Dependencies | No | The paper does not specify any software dependencies with version numbers, such as Python versions, specific libraries (e.g., PyTorch, TensorFlow), or their corresponding versions. |
| Experiment Setup | Yes | Unless otherwise noted, the simulations presented in this paper set the spatial dimension to π = 4. The core experimental setup follows that described in Example 3.1, which is outlined in Algorithm 1. [...] Each element of the initial vector β Rπis sampled independently from a standard normal distribution. For all weight matrices involved in the attention mechanism ππ,π, ππΎ,π, ππ,π, ππ,πand the matrices ππgenerating π₯π, we set π2 ππ,π= π2 ππΎ,π= π2 ππ,π= π2 ππ,π= π2 ππ= 1. [...] In our experiments, we set πΆ= 100. [...] For each such estimation, 50,000 samples are drawn, unless otherwise noted. Kernel density estimation (KDE) is used to visualize these empirical distributions. |