Elliptical Attention
Authors: Stefan Nielsen, Laziz Abdullaev, Rachel S.Y. Teo, Tan Nguyen
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically demonstrate the advantages of Elliptical Attention over the baseline dot-product attention and state-of-the-art attention methods on various practical tasks, including object classification, image segmentation, and language modeling across different data modalities. |
| Researcher Affiliation | Collaboration | Stefan K. Nielsen FPT Software AI Center Ha Noi, Vietnam stefannvkp@fpt.com Laziz U. Abdullaev Department of Mathematics National University of Singapore Singapore 119077, Singapore laziz.abdullaev@u.nus.edu Rachel S.Y. Teo Department of Mathematics National University of Singapore Singapore 119077, Singapore rachel.teo@u.nus.edu Tan M. Nguyen Department of Mathematics National University of Singapore Singapore 119077, Singapore tanmn@nus.edu.sg |
| Pseudocode | Yes | Pseudocode for the Elliptical Attention computation is provided in Appendix F.12. |
| Open Source Code | Yes | The code is publicly available at https://github.com/stefvk/Elliptical-Attention. |
| Open Datasets | Yes | We pretrain and evaluate our models on the Wiki Text-103 benchmark in comparison with the standard baseline Transformer [82], Performer [9], Transformer-MGK [52], Fourier Former [54], and the robust kernel density estimationbased Transformers including Transformer-SPKDE and Transformer-Mo M [23]. |
| Dataset Splits | Yes | The validation set and test sets consist of 60 articles with 218K and 246K tokens respectively. |
| Hardware Specification | Yes | All models are trained and evaluated on two NVIDIA A100 SXM4 40GB GPUs. |
| Software Dependencies | No | The paper mentions 'default Py Torch settings' but does not specify version numbers for PyTorch or any other software libraries or dependencies. |
| Experiment Setup | Yes | We trained with Adam using a starting learning rate of 0.00025 and cosine scheduling under default Py Torch settings. We used a batch size of 96 and trained for 120 epochs and 2000 warmup steps. The train and evaluation target lengths were set to 256. |