Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Benign Overfitting in Single-Head Attention

Authors: Roey Magen, Shuning Shang, Zhiwei Xu, Spencer Frei, Wei Hu, Gal Vardi

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we show benign-overfitting results for the attention mechanism. We consider classification with a single-head softmax attention model, and study the conditions that allow for benign overfitting. [...] In Section 6, we complement our theoretical results with an empirical study. We show that sufficiently large SNR and input dimension are necessary and sufficient to achieve benign overfitting.
Researcher Affiliation	Academia	Weizmann Institute of Science EMAIL Shuning Shang Princeton University EMAIL Zhiwei Xu University of Michigan EMAIL Spencer Frei UC Davis EMAIL University of Michigan EMAIL Weizmann Institute of Science EMAIL
Pseudocode	No	The paper describes the gradient descent optimization steps (vt+1 = vt β v L(vt, pt) and pt+1 = pt β p L(vt, pt)) but does not present them in a structured pseudocode or algorithm block.
Open Source Code	No	Although the code is not released at the time of submission to preserve anonymity, we plan to release the full implementation, including data generation scripts and instructions for reproducing all figures, upon publication.
Open Datasets	Yes	To further validate our theoretical findings, we conducted additional experiments on real-world datasets, including MNIST and CIFAR-10.
Dataset Splits	No	Since the experiments are based on synthetic data, there is no train/test split in the traditional sense. The tables for MNIST and CIFAR-10 list "Training Size n" and "Test acc (on clean data)" but do not specify split percentages or how the data was divided for training and testing.
Hardware Specification	Yes	Additionally, all experiments were conducted on a single NVIDIA T4 GPU with 16GB memory.
Software Dependencies	No	The paper does not explicitly mention any specific software dependencies or their version numbers, such as programming languages, libraries, or frameworks used for implementation.
Experiment Setup	Yes	Parameters: d = 900, T = 5, β = 0.015, η = 0.1, test sample size = 2000. (Figure 1 caption). Parameters: n = 200, d = 40000, T = 2, β = 0.025, ρ = 30, η = 0.05, test sample size = 2000. (Figure 2 caption).