Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Linear Transformers Implicitly Discover Unified Numerical Algorithms

Authors: Patrick Lutz, Aditya Gangrade, Hadi Daneshmand, Venkatesh Saligrama

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We train a linear, attention-only transformer on millions of masked-block completion tasks: each prompt is a masked low-rank matrix whose missing block may be (i) a scalar prediction target or (ii) an unseen kernel slice for Nyström extrapolation. The model sees only input output pairs and a mean-squared loss; it is given no normal equations, no handcrafted iterations, and no hint that the tasks are related. Surprisingly, after training, algebraic unrolling reveals the same parameter-free update rule across all three resource regimes (full visibility, bandwidth-limited heads, rank-limited attention). We prove that this rule achieves second-order convergence on full-batch problems, cuts distributed iteration complexity, and remains accurate with compute-limited attention.
Researcher Affiliation Academia Patrick Lutz Boston University EMAIL Aditya Gangrade Boston University EMAIL Hadi Daneshmand University of Virginia EMAIL Venkatesh Saligrama Boston University EMAIL
Pseudocode Yes Algorithm 1: Emergent Algorithm for Global Low-rank Estimation (EAGLE). Left: the numerical kernel UPDATE runs unchanged on every machine. Right: the outer loop calls UPDATE, and in the distributed regime (M>1), it averages the resulting query updates (C ) and outputs (D ). M = 1 in the unconstrained and computation-limited settings. In the former, S = In, while in the latter, S R(d+d ) r is composed of random orthogonal rows. The values η, γ are global constants (see below for their values).
Open Source Code Yes These authors contributed equally to this work; code is available on github.com.
Open Datasets Yes We evaluate the unconstrained EAGLE variant on seven low-rank matrices from the Suite Sparse Matrix Collection Davis and Hu (2011) spanning 1100 million coefficients.
Dataset Splits No The paper describes generating synthetic data and using matrices from the Suite Sparse Matrix Collection, but does not provide specific train/test/validation splits (e.g., percentages or exact counts) for any of these datasets. It mentions training over 20,000 iterations with a batch size of 1024, but not how the overall dataset was split.
Hardware Specification No The paper mentions 'CUDA solvers' and 'GPU-accelerated baselines' in Section 5.4 but does not specify any particular GPU models (e.g., NVIDIA A100, RTX 2080 Ti), CPU models, or other detailed hardware specifications.
Software Dependencies No The paper refers to software components like 'torch.linalg.lstsq' (Section 5.1) and 'cupyx.scipy.sparse.linalg' (Section 5.4), but it does not provide specific version numbers for PyTorch, SciPy, or any other libraries or programming languages used.
Experiment Setup Yes The transformer is trained using the Adam optimizer with a constant learning rate of 0.001 and a batch size of 1024 for 20,000 iterations. To stabilize training, gradient clipping with a 2-norm threshold of 0.1 is applied at each step.