Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Benign Overfitting in Single-Head Attention
Authors: Roey Magen, Shuning Shang, Zhiwei Xu, Spencer Frei, Wei Hu, Gal Vardi
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we show benign-overfitting results for the attention mechanism. We consider classification with a single-head softmax attention model, and study the conditions that allow for benign overfitting. [...] In Section 6, we complement our theoretical results with an empirical study. We show that sufficiently large SNR and input dimension are necessary and sufficient to achieve benign overfitting. |
| Researcher Affiliation | Academia | Weizmann Institute of Science EMAIL Shuning Shang Princeton University EMAIL Zhiwei Xu University of Michigan EMAIL Spencer Frei UC Davis EMAIL University of Michigan EMAIL Weizmann Institute of Science EMAIL |
| Pseudocode | No | The paper describes the gradient descent optimization steps (vt+1 = vt β v L(vt, pt) and pt+1 = pt β p L(vt, pt)) but does not present them in a structured pseudocode or algorithm block. |
| Open Source Code | No | Although the code is not released at the time of submission to preserve anonymity, we plan to release the full implementation, including data generation scripts and instructions for reproducing all figures, upon publication. |
| Open Datasets | Yes | To further validate our theoretical findings, we conducted additional experiments on real-world datasets, including MNIST and CIFAR-10. |
| Dataset Splits | No | Since the experiments are based on synthetic data, there is no train/test split in the traditional sense. The tables for MNIST and CIFAR-10 list "Training Size n" and "Test acc (on clean data)" but do not specify split percentages or how the data was divided for training and testing. |
| Hardware Specification | Yes | Additionally, all experiments were conducted on a single NVIDIA T4 GPU with 16GB memory. |
| Software Dependencies | No | The paper does not explicitly mention any specific software dependencies or their version numbers, such as programming languages, libraries, or frameworks used for implementation. |
| Experiment Setup | Yes | Parameters: d = 900, T = 5, β = 0.015, η = 0.1, test sample size = 2000. (Figure 1 caption). Parameters: n = 200, d = 40000, T = 2, β = 0.025, ρ = 30, η = 0.05, test sample size = 2000. (Figure 2 caption). |