Gated Slot Attention for Efficient Linear-Time Sequence Modeling

Authors: Yu Zhang, Songlin Yang, Rui-Jie Zhu, Yue Zhang, Leyang Cui, Yiqiao Wang, Bolun Wang, Freda Shi, Bailin Wang, Wei Bi, Peng Zhou, Guohong Fu

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments confirm GSA s superior performance in scenarios requiring in-context recall and in T2R settings.
Researcher Affiliation Collaboration 1School of Computer Science and Technology, Soochow University, China 2Massachusetts Institute of Technology 3University of California, Santa Cruz 4Tencent AI Lab 5Luxi Tech 6University of Waterloo
Pseudocode Yes Algorithm 1 Hardware-Efficient Gated Slot Attention
Open Source Code Yes https://github.com/sustcsonglin/flash-linear-attention https://huggingface.co/fla-hub We provide a PyTorch implementation for the above algorithm with chunkwise parallelism in Listing 1.
Open Datasets Yes We utilize a subset of 100B tokens picked from the Slimpajama dataset [79].
Dataset Splits No The paper describes training and testing procedures and mentions evaluation harnesses, but it does not explicitly state dataset splits for validation (e.g., percentages or counts).
Hardware Specification Yes Fig. 4a illustrates the training throughput for four models on a single H800 GPU8. We ran all models on 32 Nvidia H800 GPUs.
Software Dependencies Yes The input tokens are processed using the Mistral tokenizer [39]. We utilize the open-sourced Triton-based library FLA [95] to run all compared models. To facilitate distributed training and accelerate the process, we utilized the Deep Speed framework and fused all necessary modules, including ROPE, cross-entropy, and Layer Norm, following the practice of [102].
Experiment Setup Yes We use Adam W [50] with a weight decay 0.01 as the optimizer. During training, the learning rate is first warmed up to 3 10 4 in the first 1B tokens, and then decayed to 3 10 5 gradually with a cosine schedule. The number of attention heads is set to 4 and 5 for 1.3B and 2.7B models, respectively. The number of memory slots is uniformly set to 64 for all models.