Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Linear Transformers Implicitly Discover Unified Numerical Algorithms
Authors: Patrick Lutz, Aditya Gangrade, Hadi Daneshmand, Venkatesh Saligrama
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We train a linear, attention-only transformer on millions of masked-block completion tasks: each prompt is a masked low-rank matrix whose missing block may be (i) a scalar prediction target or (ii) an unseen kernel slice for Nyström extrapolation. The model sees only input output pairs and a mean-squared loss; it is given no normal equations, no handcrafted iterations, and no hint that the tasks are related. Surprisingly, after training, algebraic unrolling reveals the same parameter-free update rule across all three resource regimes (full visibility, bandwidth-limited heads, rank-limited attention). We prove that this rule achieves second-order convergence on full-batch problems, cuts distributed iteration complexity, and remains accurate with compute-limited attention. |
| Researcher Affiliation | Academia | Patrick Lutz Boston University EMAIL Aditya Gangrade Boston University EMAIL Hadi Daneshmand University of Virginia EMAIL Venkatesh Saligrama Boston University EMAIL |
| Pseudocode | Yes | Algorithm 1: Emergent Algorithm for Global Low-rank Estimation (EAGLE). Left: the numerical kernel UPDATE runs unchanged on every machine. Right: the outer loop calls UPDATE, and in the distributed regime (M>1), it averages the resulting query updates (C ) and outputs (D ). M = 1 in the unconstrained and computation-limited settings. In the former, S = In, while in the latter, S R(d+d ) r is composed of random orthogonal rows. The values η, γ are global constants (see below for their values). |
| Open Source Code | Yes | These authors contributed equally to this work; code is available on github.com. |
| Open Datasets | Yes | We evaluate the unconstrained EAGLE variant on seven low-rank matrices from the Suite Sparse Matrix Collection Davis and Hu (2011) spanning 1100 million coefficients. |
| Dataset Splits | No | The paper describes generating synthetic data and using matrices from the Suite Sparse Matrix Collection, but does not provide specific train/test/validation splits (e.g., percentages or exact counts) for any of these datasets. It mentions training over 20,000 iterations with a batch size of 1024, but not how the overall dataset was split. |
| Hardware Specification | No | The paper mentions 'CUDA solvers' and 'GPU-accelerated baselines' in Section 5.4 but does not specify any particular GPU models (e.g., NVIDIA A100, RTX 2080 Ti), CPU models, or other detailed hardware specifications. |
| Software Dependencies | No | The paper refers to software components like 'torch.linalg.lstsq' (Section 5.1) and 'cupyx.scipy.sparse.linalg' (Section 5.4), but it does not provide specific version numbers for PyTorch, SciPy, or any other libraries or programming languages used. |
| Experiment Setup | Yes | The transformer is trained using the Adam optimizer with a constant learning rate of 0.001 and a batch size of 1024 for 20,000 iterations. To stabilize training, gradient clipping with a 2-norm threshold of 0.1 is applied at each step. |