Selective Attention: Enhancing Transformer through Principled Context Control

Authors: Xuechen Zhang, Xiangyu Chang, Mingchen Li, Amit Roy-Chowdhury, Jiasi Chen, Samet Oymak

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through theory and experiments, we demonstrate that this alleviates attention dilution, aids the optimization process, and enhances the model s ability to control softmax spikiness of individual queries.
Researcher Affiliation Academia Xuechen Zhang University of Michigan zxuechen@umich.eduXiangyu Chang University of California, Riverside cxian008@ucr.eduMingchen Li University of Michigan milii@umich.eduAmit Roy-Chowdhury University of California, Riverside amitrc@ece.ucr.eduJiasi Chen University of Michigan jiasi@umich.eduSamet Oymak University of Michigan oymak@umich.edu
Pseudocode No The paper describes mathematical definitions and processes but does not include structured pseudocode or algorithm blocks explicitly labeled as such.
Open Source Code Yes The Git Hub repo containing SSA implementation is provided in https://github.com/umich-sota/selective_attention.We release our training and evaluation code in a zip file.
Open Datasets Yes For the pre-training evaluation, we train the model from scratch on the Slim Pajama dataset [41]...We employ GPT-2...pre-trained on the Web Text dataset [34]...For Pythia...pre-trained on the Pile dataset [17]...Llama...fine-tune using the official pre-trained model.
Dataset Splits No The paper describes pre-training on one dataset and evaluating on others, and fine-tuning on downstream tasks, but it does not provide specific train/validation/test split percentages or counts for reproducibility of data partitioning.
Hardware Specification Yes The pre-training takes about 2 hours using 4 A40 and fine-tuning takes about 2 days. All the experiments are conducted with 4 or 8 A40 or L40S.
Software Dependencies No The paper mentions software like "Flash Attention [11]" and "Flash Attention[12]" but does not specify version numbers for these or other software libraries required for replication.
Experiment Setup Yes We set the learning rate at 1e 4. As the training configuration, we train with 3.5 million tokens for fine-tuning and 15B tokens for pre-training. We always use the Adam W optimizer [24], β1 = 0.9 and β2 = 0.95. We set learning rate 1e 6 with no weight decay and no warmup.