Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Understanding Softmax Attention Layers:\\ Exact Mean-Field Analysis on a Toy Problem
Authors: Elvis Dohmatob
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | Our analysis yields exact analytic expressions for the population risk in terms of the overlaps between the learned model parameters and those of an oracle. Moreover, we derive a detailed description of the gradient descent dynamics for these overlaps and prove that, under broad conditions, the dynamics converge to the unique oracle attractor. Our work not only advances the understanding of self-attention but also provides key theoretical ideas that are likely to find use in further analyses of even more complex transformer architectures. |
| Researcher Affiliation | Collaboration | 1Concordia University 2Mila Quebec AI Institute 3Meta EMAIL |
| Pseudocode | No | The paper focuses on mathematical derivations, propositions, and proofs. It does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain an explicit statement about releasing code or a link to a code repository. The NeurIPS checklist also states: "This is purely a theoretical work. Question doesn't apply." |
| Open Datasets | No | The paper introduces and analyzes a "single-location regression problem" with synthetic data generated from a defined data distribution P (e.g., Gaussian noise). It does not use any publicly available datasets. The NeurIPS checklist also states: "This is purely a theoretical work. Question doesn't apply." |
| Dataset Splits | No | The paper studies a theoretical problem with a defined data distribution P. It does not use empirical datasets that would require explicit training/test/validation splits. The NeurIPS checklist states: "This is purely a theoretical work. Question doesn't apply." |
| Hardware Specification | Yes | Experiments were run with a single CPU on a laptop, and took less than 30 minutes in total. |
| Software Dependencies | No | The paper does not mention any specific software dependencies or their version numbers for the presented theoretical analysis or the illustrative simulations. |
| Experiment Setup | Yes | For this experiment, we use input-dimension d = 100, L = 20 blocks, (normalized) inverse-temperature β = 1, γ = 1/2, and label-noise level ϵ = 0.1. The Riemannian gradient-descent scheme is used (29) with step-size s = 0.01. The population risk R is replaced by an empirical version ˆR = n 1 Pn i=1(f(Xi; u, v) yi)2, where (X1, y1), . . . , (Xn, yn) is an iid sample of size n = 1000 from the data distribution P. The final risk R(uk, vk) shown is evaluated on an independent test sample of size 10000. |