Linear Log-Normal Attention with Unbiased Concentration
Authors: Yury Nahshan, Joseph Kampeas, Emir Haleva
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experimental results on popular natural language benchmarks reveal that our proposed Linear Log-Normal Attention outperforms other linearized attention alternatives, offering a promising avenue for enhancing the scalability of transformer models. |
| Researcher Affiliation | Industry | Yury Nahshan, Joseph Kampeas and Emir Haleva Distributed and Parallel Software Lab, Huawei Technologies Email: {first.last}@huawei.com |
| Pseudocode | No | The paper does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | We have made the code of our method available for Mind Spore1 and Py Torch2 frameworks. gitee.com/ynahshan/linear-lognormal-attention-ms github.com/ynahshan/linear-lognormal-attention |
| Open Datasets | Yes | We first pre-train the bidirectional Ro BERTa encoder model (Liu et al., 2019) using LLN Attention on the Wiki Text-103 corpus (Merity et al., 2018). Next, to evaluate the performance of LLN Attention on downstream tasks, we fine-tune our pretrained model on several NLP tasks from the General Language Understanding Evaluation (GLUE) dataset (Wang et al., 2018). We train this model for 100 epochs on Dogs vs Cats dataset 6 with LLN and Softmax Attention. https://www.kaggle.com/competitions/dogs-vs-cats-redux-kernels-edition/data. We use the Long Range Arena (LRA) (Tay et al., 2020c) benchmark to evaluate LLN Attention on longer sequences. |
| Dataset Splits | No | The paper mentions "training and validation loss" in Appendix A.8.1 and Figure 8. However, it does not explicitly provide specific details about the dataset split (e.g., percentages or sample counts) used for validation, nor does it explicitly state that standard benchmark splits were used for validation. |
| Hardware Specification | No | The paper states: "performed all measurements on a commodity GPU." This is too vague and does not provide specific hardware details (e.g., specific GPU model, number of GPUs, CPU type, or memory). |
| Software Dependencies | No | The paper mentions using "Fairseq framework (Ott et al., 2019)", "vitpytorch code base", and "Skyformer (Chen et al., 2021) code base". However, it does not specify version numbers for any of these software components. |
| Experiment Setup | Yes | For all our experiments, we use the Fairseq framework (Ott et al., 2019) with the default configuration and hyperparameters of the Ro BERTa-base model. We perform the training with FP16 precision. In the ablation study, we train the model with various fixed values of hyper-parameters α and β. |