Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Transformers are almost optimal metalearners for linear classification

Authors: Roey Magen, Gal Vardi

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We complement our theoretical results with an empirical study on metalearning with linear attention. We trained linear attention models (Eq. (3)) on data generated as specified in Section 2.1 using GD with a fixed step size and the logistic loss function. In Figure 1, we compare the in-context sample complexity of linear attention against three baseline algorithms: (i) Support Vector Machines (see Section 15 in Shalev-Shwartz and Ben-David [45]); (ii) The maximum likelihood estimator (MLE): which estimate ยต under a Gaussian prior by averaging the in-context examples (see Example 9.11 in [44]), and then uses this estimation for prediction; (iii) MLE with access to the ground-true matrix P , which first projects the data using P , and only then applies MLE. We see that the linear transformer outperforms both SVM and MLE, which lack access to P , and nearly match the performance of the MLE with projection. Additional experiments and details are provided in the appendix. Figure 1: Test accuracy versus the number of in-context examples M, where each plot represents a different signal strength R = R. We compare the performance of the trained linear transformer model against three baselines: full MLE, projected MLE (with access to the true subspace), and SVM. The transformer closely approaches the performance of the projected MLE and outperforms both the full MLE and SVM, which lack access to the subspace. Accuracy improves as the signal strength R = R increases. d = 500, k = 30, B = 20000.
Researcher Affiliation Academia Roey Magen Weizmann Institute of Science EMAIL Gal Vardi Weizmann Institute of Science EMAIL
Pseudocode No The paper describes mathematical definitions and theoretical proofs but does not contain a distinct pseudocode block or algorithm section. It refers to prediction functions and gradient descent steps as mathematical equations (e.g., Eq. 3, Wt+1 = Wt ฮฑ L(Wt)) rather than structured algorithmic steps.
Open Source Code No Answer: [No] Justification: We plan to release the code in the future
Open Datasets No We consider a natural family of tasks where each task corresponds to a class-conditional Gaussian mixture model, with the mean vectors lying in a shared k-dimensional subspace of Rd. After training on a sufficient number of such tasks, we show that the transformer can generalize to a new task using only e O(k/ e R4) in-context examples, where e R denotes the signal strength at test time. ... We consider the following metadistribution during training: Assumption 2.1 (training-time task distribution). ... Assumption 2.2 (test-time task distribution). ... We trained linear attention models (Eq. (3)) on data generated as specified in Section 2.1
Dataset Splits No The paper describes data generation and training/test tasks, but not specific dataset splits in terms of percentages or counts of a fixed dataset. It defines a meta-distribution from which 'B datasets' (tasks) are drawn for training and 'M in-context labeled samples' for a new task at test time. This is a task-based setup, not traditional train/test/validation splits for hyperparameter tuning or final evaluation of a static dataset.
Hardware Specification Yes All computations can be completed within an hour on a CPU. ... All the experiments used a CPU.
Software Dependencies No implemented in Py Torch.
Experiment Setup Yes We trained linear attention models (Eq. (3)) on data generated as specified in Section 2.1 using GD with a fixed step size ฮฑ = 0.01, N = 40 and the logistic loss function. Training was performed for 200 300 steps from a zero initialization, implemented in Py Torch.