Elliptical Attention

Authors: Stefan Nielsen, Laziz Abdullaev, Rachel S.Y. Teo, Tan Nguyen

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically demonstrate the advantages of Elliptical Attention over the baseline dot-product attention and state-of-the-art attention methods on various practical tasks, including object classification, image segmentation, and language modeling across different data modalities.
Researcher Affiliation Collaboration Stefan K. Nielsen FPT Software AI Center Ha Noi, Vietnam stefannvkp@fpt.com Laziz U. Abdullaev Department of Mathematics National University of Singapore Singapore 119077, Singapore laziz.abdullaev@u.nus.edu Rachel S.Y. Teo Department of Mathematics National University of Singapore Singapore 119077, Singapore rachel.teo@u.nus.edu Tan M. Nguyen Department of Mathematics National University of Singapore Singapore 119077, Singapore tanmn@nus.edu.sg
Pseudocode Yes Pseudocode for the Elliptical Attention computation is provided in Appendix F.12.
Open Source Code Yes The code is publicly available at https://github.com/stefvk/Elliptical-Attention.
Open Datasets Yes We pretrain and evaluate our models on the Wiki Text-103 benchmark in comparison with the standard baseline Transformer [82], Performer [9], Transformer-MGK [52], Fourier Former [54], and the robust kernel density estimationbased Transformers including Transformer-SPKDE and Transformer-Mo M [23].
Dataset Splits Yes The validation set and test sets consist of 60 articles with 218K and 246K tokens respectively.
Hardware Specification Yes All models are trained and evaluated on two NVIDIA A100 SXM4 40GB GPUs.
Software Dependencies No The paper mentions 'default Py Torch settings' but does not specify version numbers for PyTorch or any other software libraries or dependencies.
Experiment Setup Yes We trained with Adam using a starting learning rate of 0.00025 and cosine scheduling under default Py Torch settings. We used a batch size of 96 and trained for 120 epochs and 2000 warmup steps. The train and evaluation target lengths were set to 256.