Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Learning Linear Attention in Polynomial Time

Authors: Morris Yau, Ekin Akyürek, Jiayuan Mao, Josh Tenenbaum, Stefanie Jegelka, Jacob Andreas

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In the experimental section, we validate our theoretical findings. In Section 4.1, we train multiple models using stochastic gradient descent on a dataset generated by a single linear attention network s output. Our results demonstrate that multi-head linear attention outperforms both single-layer linear attention and multi-layer linear attention, achieving comparable results to our Algorithm 1. In Section 4.2, we show that our proposed certificate directly correlates with generalization error even for models trained using stochastic gradient descent.
Researcher Affiliation	Academia	Morris Yau MIT CSAIL EMAIL Ekin Akyürek MIT CSAIL EMAIL Jiayuan Mao MIT CSAIL EMAIL Joshua B. Tenenbaum MIT Brain and Cognitive Sciences EMAIL Stefanie Jegelka TUM Munich, MCML, MIT CSAIL EMAIL Jacob Andreas MIT CSAIL EMAIL
Pseudocode	Yes	Algorithm 1 MHLA Learning via Regression; Algorithm 2 Constructing Features for Certificates of Identifiability
Open Source Code	Yes	We will release code. We provide code.
Open Datasets	Yes	Associative Memory [Bietti et al., 2023, Cabannes et al., 2024] is a task of looking up a value in a table with a query.
Dataset Splits	No	Next, achieving zero training and validation loss does not by itself certify that a model has learned a target computation well enough to generalize out of distribution. We generate two datasets, one that has identifiable λmin(ΛD) > 0 and one that is nonidentifiable with λmin(ΛD) = 0. No explicit numerical splits or references to standard splits with citations are provided in the main text.
Hardware Specification	No	We have a dedicated section in Appendix A. (from NeurIPS Paper Checklist, item 8), however, Appendix A details 'Certificate for identifiability of linear attention' and does not contain hardware specifications. The paper does not provide specific hardware details elsewhere.
Software Dependencies	No	We use Adam Kingma and Ba [2014] optimizer to train linear attention model Equation (4) and the full Transformer Vaswani et al. [2017] models. optimizer Adam W Loshchilov and Hutter [2018]. The paper mentions software by name and citation but does not provide specific version numbers for libraries or frameworks used (e.g., PyTorch 1.9, Python 3.x).
Experiment Setup	Yes	hyper parameter search space d input dimension [2, 4, 8, 16] m number of heads [1, 2, 4, 8, 16] n number of layers [1, 2, 4] learning rate [0.01, 0.001] batch size [32, 64] optimizer Adam W Loshchilov and Hutter [2018]; hyper parameter search space d input dimension [2048] m number of heads [16] n number of layers [4] learning rate [0.00025] epochs 100 optimizer Adam W Loshchilov and Hutter [2018].