Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Infinite-Width Limit of a Single Attention Layer: Analysis via Tensor Programs

Authors: Mana Sakai, Ryo Karakida, Masaaki Imaizumi

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Numerical experiments validate our theoretical predictions, confirming the effectiveness of our theory at finite width and accurate description of finite-head attentions. [...] We perform simulations to validate the infinite-width limit distributions derived in Example 3.1.
Researcher Affiliation	Academia	Mana Sakai1,3 Ryo Karakida2,3 Masaaki Imaizumi1,3 1The University of Tokyo 2National Institute of Advanced Industrial Science and Technology 3RIKEN Center for Advanced Intelligence Project
Pseudocode	Yes	Algorithm 1 Multi-Head Attention (Example 3.1) Input: {𝑥𝑖}𝑖 [𝑠] R𝑛input vectors for a sequence of length 𝑠 Input: {𝑊𝑄,𝑎, 𝑊𝐾,𝑎, 𝑊𝑉,𝑎, 𝑊𝑂,𝑎}𝑎 [𝐻] R𝑛 𝑛weight matrices for 𝐻heads
Open Source Code	Yes	All simulation codes are available at https://github.com/manasakai/infinite-width-attention.
Open Datasets	No	The paper describes generating data for its simulations rather than using pre-existing open datasets. For instance, in Appendix B.1, it states: "Each element of the initial vector ℎ R𝑛is sampled independently from a standard normal distribution." No specific external dataset or repository link is provided.
Dataset Splits	No	The paper performs simulations by generating samples rather than using pre-defined dataset splits. In Appendix B.1, it states: "To estimate the empirical distributions of finite-width attention outputs and their corresponding infinite-width limits, we employ Monte Carlo sampling. For each such estimation, 50,000 samples are drawn, unless otherwise noted." It does not describe training, validation, or test splits for a specific dataset.
Hardware Specification	No	Our experiments are small-scale and implementable by a small laptop. Also, we do not pursue the computational cost in this study, so the computational resource is out of our focus.
Software Dependencies	No	The paper does not specify any software dependencies with version numbers, such as Python versions, specific libraries (e.g., PyTorch, TensorFlow), or their corresponding versions.
Experiment Setup	Yes	Unless otherwise noted, the simulations presented in this paper set the spatial dimension to 𝑠= 4. The core experimental setup follows that described in Example 3.1, which is outlined in Algorithm 1. [...] Each element of the initial vector ℎ R𝑛is sampled independently from a standard normal distribution. For all weight matrices involved in the attention mechanism 𝑊𝑄,𝑎, 𝑊𝐾,𝑎, 𝑊𝑉,𝑎, 𝑊𝑂,𝑎and the matrices 𝑊𝑖generating 𝑥𝑖, we set 𝜎2 𝑊𝑄,𝑎= 𝜎2 𝑊𝐾,𝑎= 𝜎2 𝑊𝑉,𝑎= 𝜎2 𝑊𝑂,𝑎= 𝜎2 𝑊𝑖= 1. [...] In our experiments, we set 𝐶= 100. [...] For each such estimation, 50,000 samples are drawn, unless otherwise noted. Kernel density estimation (KDE) is used to visualize these empirical distributions.