Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Gated Slot Attention for Efficient Linear-Time Sequence Modeling
Authors: Yu Zhang, Songlin Yang, Rui-Jie Zhu, Yue Zhang, Leyang Cui, Yiqiao Wang, Bolun Wang, Freda Shi, Bailin Wang, Wei Bi, Peng Zhou, Guohong Fu
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments confirm GSA s superior performance in scenarios requiring in-context recall and in T2R settings. |
| Researcher Affiliation | Collaboration | 1School of Computer Science and Technology, Soochow University, China 2Massachusetts Institute of Technology 3University of California, Santa Cruz 4Tencent AI Lab 5Luxi Tech 6University of Waterloo |
| Pseudocode | Yes | Algorithm 1 Hardware-Efficient Gated Slot Attention |
| Open Source Code | Yes | https://github.com/sustcsonglin/flash-linear-attention https://huggingface.co/fla-hub We provide a PyTorch implementation for the above algorithm with chunkwise parallelism in Listing 1. |
| Open Datasets | Yes | We utilize a subset of 100B tokens picked from the Slimpajama dataset [79]. |
| Dataset Splits | No | The paper describes training and testing procedures and mentions evaluation harnesses, but it does not explicitly state dataset splits for validation (e.g., percentages or counts). |
| Hardware Specification | Yes | Fig. 4a illustrates the training throughput for four models on a single H800 GPU8. We ran all models on 32 Nvidia H800 GPUs. |
| Software Dependencies | Yes | The input tokens are processed using the Mistral tokenizer [39]. We utilize the open-sourced Triton-based library FLA [95] to run all compared models. To facilitate distributed training and accelerate the process, we utilized the Deep Speed framework and fused all necessary modules, including ROPE, cross-entropy, and Layer Norm, following the practice of [102]. |
| Experiment Setup | Yes | We use Adam W [50] with a weight decay 0.01 as the optimizer. During training, the learning rate is first warmed up to 3 10 4 in the first 1B tokens, and then decayed to 3 10 5 gradually with a cosine schedule. The number of attention heads is set to 4 and 5 for 1.3B and 2.7B models, respectively. The number of memory slots is uniformly set to 64 for all models. |