Gated Slot Attention for Efficient Linear-Time Sequence Modeling
Authors: Yu Zhang, Songlin Yang, Rui-Jie Zhu, Yue Zhang, Leyang Cui, Yiqiao Wang, Bolun Wang, Freda Shi, Bailin Wang, Wei Bi, Peng Zhou, Guohong Fu
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments confirm GSA s superior performance in scenarios requiring in-context recall and in T2R settings. |
| Researcher Affiliation | Collaboration | 1School of Computer Science and Technology, Soochow University, China 2Massachusetts Institute of Technology 3University of California, Santa Cruz 4Tencent AI Lab 5Luxi Tech 6University of Waterloo |
| Pseudocode | Yes | Algorithm 1 Hardware-Efficient Gated Slot Attention |
| Open Source Code | Yes | https://github.com/sustcsonglin/flash-linear-attention https://huggingface.co/fla-hub We provide a PyTorch implementation for the above algorithm with chunkwise parallelism in Listing 1. |
| Open Datasets | Yes | We utilize a subset of 100B tokens picked from the Slimpajama dataset [79]. |
| Dataset Splits | No | The paper describes training and testing procedures and mentions evaluation harnesses, but it does not explicitly state dataset splits for validation (e.g., percentages or counts). |
| Hardware Specification | Yes | Fig. 4a illustrates the training throughput for four models on a single H800 GPU8. We ran all models on 32 Nvidia H800 GPUs. |
| Software Dependencies | Yes | The input tokens are processed using the Mistral tokenizer [39]. We utilize the open-sourced Triton-based library FLA [95] to run all compared models. To facilitate distributed training and accelerate the process, we utilized the Deep Speed framework and fused all necessary modules, including ROPE, cross-entropy, and Layer Norm, following the practice of [102]. |
| Experiment Setup | Yes | We use Adam W [50] with a weight decay 0.01 as the optimizer. During training, the learning rate is first warmed up to 3 10 4 in the first 1B tokens, and then decayed to 3 10 5 gradually with a cosine schedule. The number of attention heads is set to 4 and 5 for 1.3B and 2.7B models, respectively. The number of memory slots is uniformly set to 64 for all models. |