Linear Log-Normal Attention with Unbiased Concentration

Authors: Yury Nahshan, Joseph Kampeas, Emir Haleva

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experimental results on popular natural language benchmarks reveal that our proposed Linear Log-Normal Attention outperforms other linearized attention alternatives, offering a promising avenue for enhancing the scalability of transformer models.
Researcher Affiliation Industry Yury Nahshan, Joseph Kampeas and Emir Haleva Distributed and Parallel Software Lab, Huawei Technologies Email: {first.last}@huawei.com
Pseudocode No The paper does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes We have made the code of our method available for Mind Spore1 and Py Torch2 frameworks. gitee.com/ynahshan/linear-lognormal-attention-ms github.com/ynahshan/linear-lognormal-attention
Open Datasets Yes We first pre-train the bidirectional Ro BERTa encoder model (Liu et al., 2019) using LLN Attention on the Wiki Text-103 corpus (Merity et al., 2018). Next, to evaluate the performance of LLN Attention on downstream tasks, we fine-tune our pretrained model on several NLP tasks from the General Language Understanding Evaluation (GLUE) dataset (Wang et al., 2018). We train this model for 100 epochs on Dogs vs Cats dataset 6 with LLN and Softmax Attention. https://www.kaggle.com/competitions/dogs-vs-cats-redux-kernels-edition/data. We use the Long Range Arena (LRA) (Tay et al., 2020c) benchmark to evaluate LLN Attention on longer sequences.
Dataset Splits No The paper mentions "training and validation loss" in Appendix A.8.1 and Figure 8. However, it does not explicitly provide specific details about the dataset split (e.g., percentages or sample counts) used for validation, nor does it explicitly state that standard benchmark splits were used for validation.
Hardware Specification No The paper states: "performed all measurements on a commodity GPU." This is too vague and does not provide specific hardware details (e.g., specific GPU model, number of GPUs, CPU type, or memory).
Software Dependencies No The paper mentions using "Fairseq framework (Ott et al., 2019)", "vitpytorch code base", and "Skyformer (Chen et al., 2021) code base". However, it does not specify version numbers for any of these software components.
Experiment Setup Yes For all our experiments, we use the Fairseq framework (Ott et al., 2019) with the default configuration and hyperparameters of the Ro BERTa-base model. We perform the training with FP16 precision. In the ablation study, we train the model with various fixed values of hyper-parameters α and β.