Luna: Linear Unified Nested Attention
Authors: Xuezhe Ma, Xiang Kong, Sinong Wang, Chunting Zhou, Jonathan May, Hao Ma, Luke Zettlemoyer
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We perform extensive evaluations on three benchmarks of sequence modeling tasks: long-context sequence modeling, neural machine translation and masked language modeling for large-scale pretraining. Competitive or even better experimental results demonstrate both the effectiveness and efficiency of Luna compared to a variety of strong baseline methods including the full-rank attention and other efficient sparse and dense attention methods. |
| Researcher Affiliation | Collaboration | Xuezhe Ma ISI, USC xuezhema@isi.edu Xiang Kong LTI, CMU xiangk@cs.cmu.edu Sinong Wang Facebook AI sinongwang@fb.com Chunting Zhou LTI, CMU chuntinz@cs.cmu.edu Jonathan May ISI, USC jonmay@isi.edu Hao Ma, Luke Zettlemoyer Facebook AI {haom, lsz}@fb.com |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The implementation of our model is available at https://github.com/Xuezhe Max/fairseq-apollo. |
| Open Datasets | Yes | We evaluate the effectiveness and efficiency of Luna on the Long Range Arena (LRA) benchmark recently introduced by Tay et al. (2021), which is designed for the purpose of evaluating efficient Transformer models under the long-context scenario. They collect five tasks in this benchmark which are List Ops (Nangia and Bowman, 2018), byte-level text classification (Text; Maas et al., 2011), byte-level document retrieval (Retrieval; Radev et al., 2013), image classification on sequences of pixels (Image; Krizhevsky et al., 2009) and Pathfinder (Linsley et al., 2018). |
| Dataset Splits | Yes | For all tasks except for the task Retrieval, we closely follow the model configurations in Tay et al. (2021) such as data preprocessing, data split, model architecture, etc. Finetuning is performed for 20 epochs with early stopping based on each task s evaluation metric on the dev set. |
| Hardware Specification | Yes | For each experiment, we conduct distributed training across eight NVIDIA Tesla V100 GPUs with maximum batch size of 8192 tokens per GPU. |
| Software Dependencies | No | The paper mentions software like FairSeq and optimizers like Adam and Apollo but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | The Luna models closely follow the architecture of Transformer-base: 6 encoder and decoder layers with 8 attention heads and dmodel/dhidden = 512/2048. We train the Transformer-base model with two optimization methods: Adam (Kingma and Ba, 2015) and Apollo (Ma, 2020), and find Apollo achieves better performance. Therefore, we use Apollo as the optimizer for all Luna models. For each experiment, we conduct distributed training across eight NVIDIA Tesla V100 GPUs with maximum batch size of 8192 tokens per GPU. |