Mega: Moving Average Equipped Gated Attention
Authors: Xuezhe Ma, Chunting Zhou, Xiang Kong, Junxian He, Liangke Gui, Graham Neubig, Jonathan May, Luke Zettlemoyer
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on a wide range of sequence modeling benchmarks, including the Long Range Arena, neural machine translation, auto-regressive language modeling, and image and speech classification, show that MEGA achieves significant improvements over other sequence models, including variants of Transformers and recent state space models. |
| Researcher Affiliation | Collaboration | Xuezhe Ma ISI, USC Chunting Zhou Meta AI Xiang Kong LTI, CMU Junxian He SJTU Liangke Gui LTI, CMU Graham Neubig LTI, CMU Jonathan May ISI, USC Luke Zettlemoyer Meta AI |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide an explicit statement or link for open-source code release for the methodology described. |
| Open Datasets | Yes | To evaluate MEGA, we conduct experiments on five benchmark sequence modeling tasks across various data types. All the numbers with indicate results from the baseline models replicated by us. More detailed descriptions, results and analysis are provided in Appendix D. ... Long Range Arena (LRA) benchmark recently introduced by Tay et al. (2021)... WMT 2016 English-German news translation (WMT 16)... Wiki Text-103 (Merity et al., 2017) and enwik8 (Hutter, 2006)... Imagenet-1k (Deng et al., 2009) dataset... SC10 subset of the Speech Commands dataset (Warden, 2018). |
| Dataset Splits | Yes | On the Imagenet-1k (Deng et al., 2009) dataset, which consists of 1.28M training images and 50K validation images from 1000 classes. Top-1 accuracy on the validation set is reported in Table 6... We use Newstest2013 as the validation set and Newstest2014 as the test set. ... At training time, we split the training data into segments; each segment contains m consecutive chunks, where the chunk size is the effective attention length. m is a random integer variable uniformly sampled from [cl, ch]. |
| Hardware Specification | No | The paper mentions that 'the computation of MEGA and Transformer can not fit in GPU memory,' but does not specify any particular GPU models, CPUs, or other hardware components used for the experiments. |
| Software Dependencies | No | The paper mentions using the 'Fair Seq package (Ott et al., 2019)' but does not specify a version number for it or any other software dependencies. |
| Experiment Setup | Yes | The hyper-parameters of MEGA models on these tasks are listed in Table 8. ... Other training hyperparameters including optimizer, learning rate scheduler and architecture are presented in Table 9. ... The hyper-parameters of Transformer and MEGA models are listed in Table 10. ... Hyper-parameters are listed in Table 11. |