Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Bayes optimal learning of attention-indexed models

Authors: Fabrizio Boncoraglio, Emanuele Troiani, Vittorio Erba, Lenka Zdeborová

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We introduce the attention-indexed model (AIM), a theoretical framework for analyzing learning in deep attention layers... We propose a matching approximate message passing algorithm and show that gradient descent can reach optimal performance. AIM offers a solvable playground for understanding learning in self-attention layers, that are key components of modern architectures. The NeurIPS Paper Checklist also includes sections for "Experimental result reproducibility", "Open access to data and code", "Experimental setting/details", "Experiment statistical significance", and "Experiments compute resources", indicating the presence of empirical studies.
Researcher Affiliation Academia Fabrizio Boncoraglio Statistical Physics of Computation Laboratory EPFL, Switzerland Emanuele Troiani Statistical Physics of Computation Laboratory EPFL, Switzerland Vittorio Erba Statistical Physics of Computation Laboratory EPFL, Switzerland Lenka Zdeborová Statistical Physics of Computation Laboratory EPFL, Switzerland
Pseudocode Yes Algorithm 1: AMP
Open Source Code Yes The code used to produce all the figures and the experiments is available at https://github.com/ SPOC-group/Extensive Rank Attention.
Open Datasets Yes We consider a dataset D = {yµ , Xµ 0 } of n samples indexed by µ, where Xµ 0 = {xµ a }T a=1 RT d has rows xµ a . Each sample consists of the embeddings of T tokens xµ a Rd, taken as standard Gaussian xµ a N(0, Id)... Our work is theoretical on syntetic data... We use synthetic data that we generate ourself... We release the code and the data for reproducing the figures.
Dataset Splits Yes We consider a dataset D = {yµ , Xµ 0 } of n samples indexed by µ... The task is then to optimally estimate either the weights S (estimation task)... or the label associated to a new input sample Xnew (generalization task)... NeurIPS Paper Checklist - Experimental setting/details: Yes, in the supplementary material
Hardware Specification No The paper does not explicitly specify hardware details such as GPU/CPU models, processors, or memory in the main text or supplementary material (Appendix D.8), despite the NeurIPS checklist stating that such information is in the supplementary material.
Software Dependencies Yes Our gradient descent experiments are done in Py Torch 1.12.1 by minimizing the following loss using Adam
Experiment Setup Yes We choose a learning rate 0.1 and keep the other hyperparameters at their default parameters and initializing the weights as a standard Gaussian.