Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Learning Linear Attention in Polynomial Time
Authors: Morris Yau, Ekin Akyürek, Jiayuan Mao, Josh Tenenbaum, Stefanie Jegelka, Jacob Andreas
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In the experimental section, we validate our theoretical findings. In Section 4.1, we train multiple models using stochastic gradient descent on a dataset generated by a single linear attention network s output. Our results demonstrate that multi-head linear attention outperforms both single-layer linear attention and multi-layer linear attention, achieving comparable results to our Algorithm 1. In Section 4.2, we show that our proposed certificate directly correlates with generalization error even for models trained using stochastic gradient descent. |
| Researcher Affiliation | Academia | Morris Yau MIT CSAIL EMAIL Ekin Akyürek MIT CSAIL EMAIL Jiayuan Mao MIT CSAIL EMAIL Joshua B. Tenenbaum MIT Brain and Cognitive Sciences EMAIL Stefanie Jegelka TUM Munich, MCML, MIT CSAIL EMAIL Jacob Andreas MIT CSAIL EMAIL |
| Pseudocode | Yes | Algorithm 1 MHLA Learning via Regression; Algorithm 2 Constructing Features for Certificates of Identifiability |
| Open Source Code | Yes | We will release code. We provide code. |
| Open Datasets | Yes | Associative Memory [Bietti et al., 2023, Cabannes et al., 2024] is a task of looking up a value in a table with a query. |
| Dataset Splits | No | Next, achieving zero training and validation loss does not by itself certify that a model has learned a target computation well enough to generalize out of distribution. We generate two datasets, one that has identifiable λmin(ΛD) > 0 and one that is nonidentifiable with λmin(ΛD) = 0. No explicit numerical splits or references to standard splits with citations are provided in the main text. |
| Hardware Specification | No | We have a dedicated section in Appendix A. (from NeurIPS Paper Checklist, item 8), however, Appendix A details 'Certificate for identifiability of linear attention' and does not contain hardware specifications. The paper does not provide specific hardware details elsewhere. |
| Software Dependencies | No | We use Adam Kingma and Ba [2014] optimizer to train linear attention model Equation (4) and the full Transformer Vaswani et al. [2017] models. optimizer Adam W Loshchilov and Hutter [2018]. The paper mentions software by name and citation but does not provide specific version numbers for libraries or frameworks used (e.g., PyTorch 1.9, Python 3.x). |
| Experiment Setup | Yes | hyper parameter search space d input dimension [2, 4, 8, 16] m number of heads [1, 2, 4, 8, 16] n number of layers [1, 2, 4] learning rate [0.01, 0.001] batch size [32, 64] optimizer Adam W Loshchilov and Hutter [2018]; hyper parameter search space d input dimension [2048] m number of heads [16] n number of layers [4] learning rate [0.00025] epochs 100 optimizer Adam W Loshchilov and Hutter [2018]. |