Selective Attention: Enhancing Transformer through Principled Context Control
Authors: Xuechen Zhang, Xiangyu Chang, Mingchen Li, Amit Roy-Chowdhury, Jiasi Chen, Samet Oymak
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through theory and experiments, we demonstrate that this alleviates attention dilution, aids the optimization process, and enhances the model s ability to control softmax spikiness of individual queries. |
| Researcher Affiliation | Academia | Xuechen Zhang University of Michigan zxuechen@umich.eduXiangyu Chang University of California, Riverside cxian008@ucr.eduMingchen Li University of Michigan milii@umich.eduAmit Roy-Chowdhury University of California, Riverside amitrc@ece.ucr.eduJiasi Chen University of Michigan jiasi@umich.eduSamet Oymak University of Michigan oymak@umich.edu |
| Pseudocode | No | The paper describes mathematical definitions and processes but does not include structured pseudocode or algorithm blocks explicitly labeled as such. |
| Open Source Code | Yes | The Git Hub repo containing SSA implementation is provided in https://github.com/umich-sota/selective_attention.We release our training and evaluation code in a zip file. |
| Open Datasets | Yes | For the pre-training evaluation, we train the model from scratch on the Slim Pajama dataset [41]...We employ GPT-2...pre-trained on the Web Text dataset [34]...For Pythia...pre-trained on the Pile dataset [17]...Llama...fine-tune using the official pre-trained model. |
| Dataset Splits | No | The paper describes pre-training on one dataset and evaluating on others, and fine-tuning on downstream tasks, but it does not provide specific train/validation/test split percentages or counts for reproducibility of data partitioning. |
| Hardware Specification | Yes | The pre-training takes about 2 hours using 4 A40 and fine-tuning takes about 2 days. All the experiments are conducted with 4 or 8 A40 or L40S. |
| Software Dependencies | No | The paper mentions software like "Flash Attention [11]" and "Flash Attention[12]" but does not specify version numbers for these or other software libraries required for replication. |
| Experiment Setup | Yes | We set the learning rate at 1e 4. As the training configuration, we train with 3.5 million tokens for fine-tuning and 15B tokens for pre-training. We always use the Adam W optimizer [24], β1 = 0.9 and β2 = 0.95. We set learning rate 1e 6 with no weight decay and no warmup. |