Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Attention with Trained Embeddings Provably Selects Important Tokens

Authors: Diyuan Wu, Aleksandr Shevchenko, Samet Oymak, Marco Mondelli

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on real-world datasets (IMDB, Yelp) exhibit a phenomenology close to that unveiled by our theory. 5 Numerical experiments To support our theoretical findings, we showcase the correlation of the embeddings with the cls embedding p and the output vector v, having trained all the parameters with gradient descent until convergence. We consider different datasets (synthetic data in Figure 2; IMDB/Yelp datasets in Figures 1 and 3) and different architectures (one-layer model (1) in Figures 2 and 3; two-layer model (18) in Figure 1).
Researcher Affiliation Academia Diyuan Wu1, Aleksandr Shevchenko2, Samet Oymak3 Marco Mondelli1 ISTA1 ETH Zürich2 University of Michigan3
Pseudocode No The paper primarily presents theoretical analysis, lemmas, theorems, and proofs. It does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it feature structured steps formatted like code within the main text or appendices.
Open Source Code No Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: We do not find it necessary to release the code. Our experiments concern the training of rather standard architectures and they can be readily reproduced without needing to upload the code.
Open Datasets Yes IMDB and Yelp datasets. The IMDB dataset3 consists of 50000 reviews of average length 239 words per review, associated to either a positive or a negative sentiment. Yelp reviews4 provide a much larger selection. 3https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews 4https://www.kaggle.com/datasets/yelp-dataset/yelp-dataset
Dataset Splits No The paper describes data preprocessing steps such as subsampling the Yelp dataset based on character length and removing neutral reviews, but it does not specify explicit training, validation, or test dataset splits (e.g., percentages or exact counts) needed for reproduction.
Hardware Specification No Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [No] Justification: Our experiments are on shallow attention layers, and they do not require a large amount of computational resources.
Software Dependencies No For all numerical simulations, we use the Adam W optimizer from torch.optim, and we reduce the learning rate in a multiplicative fashion by a factor γ = 0.1 at epochs 100 and 200, i.e., LRnew = LRold γ. (This mentions 'torch.optim' but does not provide a specific version number for PyTorch or any other software dependency).
Experiment Setup Yes For all numerical simulations, we use the Adam W optimizer from torch.optim, and we reduce the learning rate in a multiplicative fashion by a factor γ = 0.1 at epochs 100 and 200, i.e., LRnew = LRold γ. We adhere to the batch size of 128 and fix the embedding dimension to 2048. IMDB and Yelp datasets. The hyperparameters do not differ between the two-layer model and the one-layer model. We set the number of training epochs to 500, the learning rate to 0.01, and the weight decay to 10 8. Synthetic data. We set the number of training epochs to 196, the learning rate to 10 4, and the weight decay to 10 4.