Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Gated Slot Attention for Efficient Linear-Time Sequence Modeling

Authors: Yu Zhang, Songlin Yang, Rui-Jie Zhu, Yue Zhang, Leyang Cui, Yiqiao Wang, Bolun Wang, Freda Shi, Bailin Wang, Wei Bi, Peng Zhou, Guohong Fu

NeurIPS 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments confirm GSA s superior performance in scenarios requiring in-context recall and in T2R settings.
Researcher Affiliation Collaboration 1School of Computer Science and Technology, Soochow University, China 2Massachusetts Institute of Technology 3University of California, Santa Cruz 4Tencent AI Lab 5Luxi Tech 6University of Waterloo
Pseudocode Yes Algorithm 1 Hardware-Efficient Gated Slot Attention
Open Source Code Yes https://github.com/sustcsonglin/flash-linear-attention https://huggingface.co/fla-hub We provide a PyTorch implementation for the above algorithm with chunkwise parallelism in Listing 1.
Open Datasets Yes We utilize a subset of 100B tokens picked from the Slimpajama dataset [79].
Dataset Splits No The paper describes training and testing procedures and mentions evaluation harnesses, but it does not explicitly state dataset splits for validation (e.g., percentages or counts).
Hardware Specification Yes Fig. 4a illustrates the training throughput for four models on a single H800 GPU8. We ran all models on 32 Nvidia H800 GPUs.
Software Dependencies Yes The input tokens are processed using the Mistral tokenizer [39]. We utilize the open-sourced Triton-based library FLA [95] to run all compared models. To facilitate distributed training and accelerate the process, we utilized the Deep Speed framework and fused all necessary modules, including ROPE, cross-entropy, and Layer Norm, following the practice of [102].
Experiment Setup Yes We use Adam W [50] with a weight decay 0.01 as the optimizer. During training, the learning rate is first warmed up to 3 10 4 in the first 1B tokens, and then decayed to 3 10 5 gradually with a cosine schedule. The number of attention heads is set to 4 and 5 for 1.3B and 2.7B models, respectively. The number of memory slots is uniformly set to 64 for all models.