Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Attention with Trained Embeddings Provably Selects Important Tokens

Authors: Diyuan Wu, Aleksandr Shevchenko, Samet Oymak, Marco Mondelli

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on real-world datasets (IMDB, Yelp) exhibit a phenomenology close to that unveiled by our theory. 5 Numerical experiments To support our theoretical findings, we showcase the correlation of the embeddings with the cls embedding p and the output vector v, having trained all the parameters with gradient descent until convergence. We consider different datasets (synthetic data in Figure 2; IMDB/Yelp datasets in Figures 1 and 3) and different architectures (one-layer model (1) in Figures 2 and 3; two-layer model (18) in Figure 1).
Researcher Affiliation	Academia	Diyuan Wu1, Aleksandr Shevchenko2, Samet Oymak3 Marco Mondelli1 ISTA1 ETH Zürich2 University of Michigan3
Pseudocode	No	The paper primarily presents theoretical analysis, lemmas, theorems, and proofs. It does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it feature structured steps formatted like code within the main text or appendices.
Open Source Code	No	Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: We do not find it necessary to release the code. Our experiments concern the training of rather standard architectures and they can be readily reproduced without needing to upload the code.
Open Datasets	Yes	IMDB and Yelp datasets. The IMDB dataset3 consists of 50000 reviews of average length 239 words per review, associated to either a positive or a negative sentiment. Yelp reviews4 provide a much larger selection. 3https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews 4https://www.kaggle.com/datasets/yelp-dataset/yelp-dataset
Dataset Splits	No	The paper describes data preprocessing steps such as subsampling the Yelp dataset based on character length and removing neutral reviews, but it does not specify explicit training, validation, or test dataset splits (e.g., percentages or exact counts) needed for reproduction.
Hardware Specification	No	Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [No] Justification: Our experiments are on shallow attention layers, and they do not require a large amount of computational resources.
Software Dependencies	No	For all numerical simulations, we use the Adam W optimizer from torch.optim, and we reduce the learning rate in a multiplicative fashion by a factor γ = 0.1 at epochs 100 and 200, i.e., LRnew = LRold γ. (This mentions 'torch.optim' but does not provide a specific version number for PyTorch or any other software dependency).
Experiment Setup	Yes	For all numerical simulations, we use the Adam W optimizer from torch.optim, and we reduce the learning rate in a multiplicative fashion by a factor γ = 0.1 at epochs 100 and 200, i.e., LRnew = LRold γ. We adhere to the batch size of 128 and fix the embedding dimension to 2048. IMDB and Yelp datasets. The hyperparameters do not differ between the two-layer model and the one-layer model. We set the number of training epochs to 500, the learning rate to 0.01, and the weight decay to 10 8. Synthetic data. We set the number of training epochs to 196, the learning rate to 10 4, and the weight decay to 10 4.