MetaLA: Unified Optimal Linear Approximation to Softmax Attention Map

Authors: YUHONG CHOU, Man Yao, Kexin Wang, Yuqi Pan, Rui-Jie Zhu, Jibin Wu, Yiran Zhong, Yu Qiao, Bo Xu, Guoqi Li

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments on Multi-Query Associative Recall (MQAR) task, language modeling, image classification, and Long-Range Arena (LRA) benchmark demonstrate that Meta LA is more effective than the existing linear models.
Researcher Affiliation Collaboration Yuhong Chou1,2 , Man Yao2 , Kexin Wang2, Yuqi Pan2, Ruijie Zhu3, Yiran Zhong4, Yu Qiao4, Jibin Wu1, Bo Xu2, Guoqi Li2 1The Hong Kong Polytechnic University 2Institute of Automation, Chinese Academy of Sciences 3UC Santa Cruz 4Shanghai AI Lab
Pseudocode No The paper does not include pseudocode or clearly labeled algorithm blocks.
Open Source Code Yes Code: https://github.com/BICLab/Meta LA
Open Datasets Yes Our experiments on Multi-Query Associative Recall (MQAR) task, language modeling, image classification, and Long-Range Arena (LRA) benchmark demonstrate that Meta LA is more effective than the existing linear models. Code: https://github.com/BICLab/Meta LA
Dataset Splits Yes For 360M/1.4B model, we train it from scrach with a total of 15B/300B tokens on 16/32 A100 GPUs at a learning rate of 3e-4/2e-4 with batch size 0.5M/2M. Both models maintain a length of 2048 and are trained using fp16.
Hardware Specification Yes For 360M/1.4B model, we train it from scrach with a total of 15B/300B tokens on 16/32 A100 GPUs at a learning rate of 3e-4/2e-4 with batch size 0.5M/2M. Both models maintain a length of 2048 and are trained using fp16.
Software Dependencies No The paper mentions software components like "GPT-Neox [45]" and "Flash Attention [90]" and states that "We implement all the pre-train experiments with GPT-Neox [45]". However, it does not provide specific version numbers for these or other software dependencies like Python or PyTorch.
Experiment Setup Yes The hyperparameters for all tasks can be found in Table A4. For 360M/1.4B model, we train it from scrach with a total of 15B/300B tokens on 16/32 A100 GPUs at a learning rate of 3e-4/2e-4 with batch size 0.5M/2M. Both models maintain a length of 2048 and are trained using fp16. The training setup of baselines [15, 16, 44] of 360M Meta LA are aligned with Meta LA configurations. For the 1.3B Meta LA, we compare it with publicly available models [14, 15, 16, 20, 44]. To maintain a fair comparison between linear models, we trained Mamba from scratch using the same settings with Meta LA on 100B tokens. For GLA and Retnet, we adopted the open-source checkpoints in FLA [79]. For HGRN and Pythia, we used the official open-source checkpoints. We implement all the pretrain experiments with GPTNeox [45]. We evaluate our models on Super GLUE benchmark [38] and Common-Sense Reasoning benchmark including LAMBADA [80], Logi QA [81], Winograd Schema Challenge (WSC273) [82], Bool Q [83], Pi QA [84], Hella Swag [85], Wino Grande [86], ARC-easy (ARC-e), ARC-challenge (ARC-c) [87], Open Book QA [88]. We report perplexity (ppl) and accuracy (acc) on LAMBADA, accuracy normalized by length on Hella Swag, ARC-challenge and Openbook QA, and acc on the other subtasks. For Super GLUE benchmark, we report F1 score on CB, Exact-Match (EM) score on Multi RC, and accuracy on the other subtasks, following the original work. The LM evaluation harness [89] is used to implement all evaluations.