Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Transformers Learn Faster with Semantic Focus

Authors: Parikshit Ram, Kenneth Clarkson, Tim Klinger, Shashanka Ubaru, Alexander G. Gray

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically studying a range of attention mechanisms, we find that input-dependent sparse attention models appear to converge faster and generalize better than standard attention models, while input-agnostic sparse attention models show no such benefits a phenomenon that is robust across architectural and optimization hyperparameter choices. This can be interpreted as demonstrating that concentrating a model s semantic focus with respect to the tokens currently being considered (in the form of input-dependent sparse attention) accelerates learning. We develop a theoretical characterization of the conditions that explain this behavior.
Researcher Affiliation Collaboration Parikshit Ram1, Kenneth Clarkson1, Tim Klinger1, Shashanka Ubaru1, Alexander Gray2,3 1IBM Research, 2Centaur AI Institute, 3Purdue University EMAIL, EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes theoretical characterizations and empirical observations but does not contain any explicit pseudocode or algorithm blocks. The methods are described mathematically and textually in Section 4 and its appendices.
Open Source Code Yes The details for the empirical evaluation are provided in appendix D. We also provide the code with necessary documentation to reproduce the experimental results in this Git Hub repository.
Open Datasets Yes We consider the List Operations or List Ops task [62] from the LRA benchmark [4]... From the NNCH benchmark [11]... We consider a preliminary experiment with the Penn Tree Bank [63] natural language dataset
Dataset Splits Yes For all the tasks, we utilize a training / holdout sets of sizes 5000 / 2000.
Hardware Specification Yes All our empirical evaluations are performed on a Intel i7 Core CPU (16 threads, 64GB memory), and a Nvidia V100 GPU (8GB memory).
Software Dependencies Yes The implementation is in Pytorch 2.2 with CUDA 12.4.
Experiment Setup Yes For the NNCH tasks, we considered the transformer architecture used in Deletang et al. [11] with (i) T = 5 transformer blocks, (ii) embedding dimension d = 64 and (iii) the MLP hidden layer d MLP = 64, but with a single head (instead of 8) and a dropout of 0.01. The final classification layer uses the average of all the token representations after the final transformer block. For the List Ops task, we utilize the same architecture but use T = 10 transformer blocks for the initial experiment. We also consider varying number of heads and blocks in our experiments studying the effect of hyperparameters. For all problems, we use the SGD optimizer and the Step LR learning rate scheduler with a decay rate of 0.99 for List Ops and 0.9995 for NNCH tasks and a decay period of 1 epoch. For the NNCH tasks, we use an initial learning rate of 0.1, while we use 1.0 for List Ops. The number of epochs is selected to ensure that standard full attention transformer is able to consistently achieve 100% training accuracy (and thus, the ERM has converged). Thus, we use 100 epochs for Even Pairs, 200 epochs for List Ops and Stack Manipulation, 250 epochs for Missing Duplicates, 600 epochs for Modular Arithmetic and Solve Equation, 750 epochs for Cycle Navigation, and 1000 for Parity.