Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Selective Attention: Enhancing Transformer through Principled Context Control
Authors: Xuechen Zhang, Xiangyu Chang, Mingchen Li, Amit Roy-Chowdhury, Jiasi Chen, Samet Oymak
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through theory and experiments, we demonstrate that this alleviates attention dilution, aids the optimization process, and enhances the model s ability to control softmax spikiness of individual queries. |
| Researcher Affiliation | Academia | Xuechen Zhang University of Michigan EMAIL Chang University of California, Riverside EMAIL Li University of Michigan EMAIL Roy-Chowdhury University of California, Riverside EMAIL Chen University of Michigan EMAIL Oymak University of Michigan EMAIL |
| Pseudocode | No | The paper describes mathematical definitions and processes but does not include structured pseudocode or algorithm blocks explicitly labeled as such. |
| Open Source Code | Yes | The Git Hub repo containing SSA implementation is provided in https://github.com/umich-sota/selective_attention.We release our training and evaluation code in a zip file. |
| Open Datasets | Yes | For the pre-training evaluation, we train the model from scratch on the Slim Pajama dataset [41]...We employ GPT-2...pre-trained on the Web Text dataset [34]...For Pythia...pre-trained on the Pile dataset [17]...Llama...fine-tune using the official pre-trained model. |
| Dataset Splits | No | The paper describes pre-training on one dataset and evaluating on others, and fine-tuning on downstream tasks, but it does not provide specific train/validation/test split percentages or counts for reproducibility of data partitioning. |
| Hardware Specification | Yes | The pre-training takes about 2 hours using 4 A40 and fine-tuning takes about 2 days. All the experiments are conducted with 4 or 8 A40 or L40S. |
| Software Dependencies | No | The paper mentions software like "Flash Attention [11]" and "Flash Attention[12]" but does not specify version numbers for these or other software libraries required for replication. |
| Experiment Setup | Yes | We set the learning rate at 1e 4. As the training configuration, we train with 3.5 million tokens for fine-tuning and 15B tokens for pre-training. We always use the Adam W optimizer [24], β1 = 0.9 and β2 = 0.95. We set learning rate 1e 6 with no weight decay and no warmup. |