reproducibilityindex.ai

Mega: Moving Average Equipped Gated Attention

Authors: Xuezhe Ma, Chunting Zhou, Xiang Kong, Junxian He, Liangke Gui, Graham Neubig, Jonathan May, Luke Zettlemoyer

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on a wide range of sequence modeling benchmarks, including the Long Range Arena, neural machine translation, auto-regressive language modeling, and image and speech classification, show that MEGA achieves significant improvements over other sequence models, including variants of Transformers and recent state space models.
Researcher Affiliation	Collaboration	Xuezhe Ma ISI, USC Chunting Zhou Meta AI Xiang Kong LTI, CMU Junxian He SJTU Liangke Gui LTI, CMU Graham Neubig LTI, CMU Jonathan May ISI, USC Luke Zettlemoyer Meta AI
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide an explicit statement or link for open-source code release for the methodology described.
Open Datasets	Yes	To evaluate MEGA, we conduct experiments on five benchmark sequence modeling tasks across various data types. All the numbers with indicate results from the baseline models replicated by us. More detailed descriptions, results and analysis are provided in Appendix D. ... Long Range Arena (LRA) benchmark recently introduced by Tay et al. (2021)... WMT 2016 English-German news translation (WMT 16)... Wiki Text-103 (Merity et al., 2017) and enwik8 (Hutter, 2006)... Imagenet-1k (Deng et al., 2009) dataset... SC10 subset of the Speech Commands dataset (Warden, 2018).
Dataset Splits	Yes	On the Imagenet-1k (Deng et al., 2009) dataset, which consists of 1.28M training images and 50K validation images from 1000 classes. Top-1 accuracy on the validation set is reported in Table 6... We use Newstest2013 as the validation set and Newstest2014 as the test set. ... At training time, we split the training data into segments; each segment contains m consecutive chunks, where the chunk size is the effective attention length. m is a random integer variable uniformly sampled from [cl, ch].
Hardware Specification	No	The paper mentions that 'the computation of MEGA and Transformer can not fit in GPU memory,' but does not specify any particular GPU models, CPUs, or other hardware components used for the experiments.
Software Dependencies	No	The paper mentions using the 'Fair Seq package (Ott et al., 2019)' but does not specify a version number for it or any other software dependencies.
Experiment Setup	Yes	The hyper-parameters of MEGA models on these tasks are listed in Table 8. ... Other training hyperparameters including optimizer, learning rate scheduler and architecture are presented in Table 9. ... The hyper-parameters of Transformer and MEGA models are listed in Table 10. ... Hyper-parameters are listed in Table 11.