reproducibilityindex.ai

Luna: Linear Unified Nested Attention

Authors: Xuezhe Ma, Xiang Kong, Sinong Wang, Chunting Zhou, Jonathan May, Hao Ma, Luke Zettlemoyer

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We perform extensive evaluations on three benchmarks of sequence modeling tasks: long-context sequence modeling, neural machine translation and masked language modeling for large-scale pretraining. Competitive or even better experimental results demonstrate both the effectiveness and efﬁciency of Luna compared to a variety of strong baseline methods including the full-rank attention and other efﬁcient sparse and dense attention methods.
Researcher Affiliation	Collaboration	Xuezhe Ma ISI, USC xuezhema@isi.edu Xiang Kong LTI, CMU xiangk@cs.cmu.edu Sinong Wang Facebook AI sinongwang@fb.com Chunting Zhou LTI, CMU chuntinz@cs.cmu.edu Jonathan May ISI, USC jonmay@isi.edu Hao Ma, Luke Zettlemoyer Facebook AI {haom, lsz}@fb.com
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	The implementation of our model is available at https://github.com/Xuezhe Max/fairseq-apollo.
Open Datasets	Yes	We evaluate the effectiveness and efﬁciency of Luna on the Long Range Arena (LRA) benchmark recently introduced by Tay et al. (2021), which is designed for the purpose of evaluating efﬁcient Transformer models under the long-context scenario. They collect ﬁve tasks in this benchmark which are List Ops (Nangia and Bowman, 2018), byte-level text classiﬁcation (Text; Maas et al., 2011), byte-level document retrieval (Retrieval; Radev et al., 2013), image classiﬁcation on sequences of pixels (Image; Krizhevsky et al., 2009) and Pathﬁnder (Linsley et al., 2018).
Dataset Splits	Yes	For all tasks except for the task Retrieval, we closely follow the model conﬁgurations in Tay et al. (2021) such as data preprocessing, data split, model architecture, etc. Finetuning is performed for 20 epochs with early stopping based on each task s evaluation metric on the dev set.
Hardware Specification	Yes	For each experiment, we conduct distributed training across eight NVIDIA Tesla V100 GPUs with maximum batch size of 8192 tokens per GPU.
Software Dependencies	No	The paper mentions software like FairSeq and optimizers like Adam and Apollo but does not provide specific version numbers for these or other software dependencies.
Experiment Setup	Yes	The Luna models closely follow the architecture of Transformer-base: 6 encoder and decoder layers with 8 attention heads and dmodel/dhidden = 512/2048. We train the Transformer-base model with two optimization methods: Adam (Kingma and Ba, 2015) and Apollo (Ma, 2020), and ﬁnd Apollo achieves better performance. Therefore, we use Apollo as the optimizer for all Luna models. For each experiment, we conduct distributed training across eight NVIDIA Tesla V100 GPUs with maximum batch size of 8192 tokens per GPU.