reproducibilityindex.ai

Gated Slot Attention for Efficient Linear-Time Sequence Modeling

Authors: Yu Zhang, Songlin Yang, Rui-Jie Zhu, Yue Zhang, Leyang Cui, Yiqiao Wang, Bolun Wang, Freda Shi, Bailin Wang, Wei Bi, Peng Zhou, Guohong Fu

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments confirm GSA s superior performance in scenarios requiring in-context recall and in T2R settings.
Researcher Affiliation	Collaboration	1School of Computer Science and Technology, Soochow University, China 2Massachusetts Institute of Technology 3University of California, Santa Cruz 4Tencent AI Lab 5Luxi Tech 6University of Waterloo
Pseudocode	Yes	Algorithm 1 Hardware-Efficient Gated Slot Attention
Open Source Code	Yes	https://github.com/sustcsonglin/flash-linear-attention https://huggingface.co/fla-hub We provide a PyTorch implementation for the above algorithm with chunkwise parallelism in Listing 1.
Open Datasets	Yes	We utilize a subset of 100B tokens picked from the Slimpajama dataset [79].
Dataset Splits	No	The paper describes training and testing procedures and mentions evaluation harnesses, but it does not explicitly state dataset splits for validation (e.g., percentages or counts).
Hardware Specification	Yes	Fig. 4a illustrates the training throughput for four models on a single H800 GPU8. We ran all models on 32 Nvidia H800 GPUs.
Software Dependencies	Yes	The input tokens are processed using the Mistral tokenizer [39]. We utilize the open-sourced Triton-based library FLA [95] to run all compared models. To facilitate distributed training and accelerate the process, we utilized the Deep Speed framework and fused all necessary modules, including ROPE, cross-entropy, and Layer Norm, following the practice of [102].
Experiment Setup	Yes	We use Adam W [50] with a weight decay 0.01 as the optimizer. During training, the learning rate is first warmed up to 3 10 4 in the first 1B tokens, and then decayed to 3 10 5 gradually with a cosine schedule. The number of attention heads is set to 4 and 5 for 1.3B and 2.7B models, respectively. The number of memory slots is uniformly set to 64 for all models.