Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
RAT: Bridging RNN Efficiency and Attention Accuracy via Chunk-based Sequence Modeling
Authors: Xiuying Wei, Anunay Yadav, Razvan Pascanu, Caglar Gulcehre
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, with a chunk size of 16, the RAT block achieves a 7 improvement in training speed for 100K sequence length and 9 in generation at the 4K position, while maintaining similar performance compared to standard attention. We demonstrate this by training 1.3B parameter models from scratch and performing large-scale evaluations, including shortand long-context benchmarks, as well as supervised finetuning (SFT). 4 Experiments In this section, we present the efficiency and large-scale evaluations of RAT, along with comparisons to other models. |
| Researcher Affiliation | Collaboration | Xiuying Wei1 , Anunay Yadav1, Razvan Pascanu2, Caglar Gulcehre1 1CLAIRE, EPFL 2Google Deep Mind |
| Pseudocode | Yes | A.1 Algorithm We provide the pseudocode for the training and prefilling modes of RAT in Listing 1, and the pseudocode for the generation mode in Listing 2. |
| Open Source Code | Yes | Code is available at https://github.com/CLAIRE-Labo/RAT. |
| Open Datasets | Yes | For the 1.3B model experiments, we adopt the Fine Web-Edu dataset [26], using its 100B-token randomly sampled version downloaded from the Hugging Face repository. ... For downstream evaluation, we consider a suite of classical commonsense reasoning benchmarks from the Eleuther AI evaluation harness [28], including PIQA [48], ARC-C [49], and Hella Swag [50]. For the Long Bench evaluation... For SFT-based tasks, we use Narrative QA [29] (two modes), QMSum [30], and Wiki Sum [31]... We adopt the RULER benchmark [32]... |
| Dataset Splits | Yes | For SFT tasks, we train the models on the official training splits with an answer-only loss and evaluate them on the corresponding test sets. ... We generate 1000 synthetic training samples for each of the 8 tasks, resulting in a total of 8000 examples disjoint from the validation sets. |
| Hardware Specification | Yes | We benchmark the latency of a single token mixing block, including input and output projections, on a single H100 GPU (GH200 system, 120GB)... Each model is trained on 16 H100 GPUs... |
| Software Dependencies | No | For training, we implement intra-chunk recurrence in Eq. (3) using Py Torch's associative scan, enabling forward and backward passes with O(T) FLOPs. ... For inter-chunk attention, we use Py Torch's flex attention... For decoding, ... standard implementations like flash attention [25] can be used without modification... All models are compiled using torch.compile and evaluated in bfloat16 with torch.cuda.amp. |
| Experiment Setup | Yes | For brevity, we summarize the setup for the 1.3B model with a 4K context window, which is used in most of our experiments. Full implementation details are available in Appendix A. ... a learning rate of 8.0e-4 decayed to 1.0e-6 (cosine schedule) and a global batch size of 2M tokens... 1.3B-parameter model uses a model dimension of 2048, 24 transformer layers, and a head dimension of 128, equipped with RMSNorm [51] and without bias. The Ro PE base is also set to 10,000. The model parameters are initialized using a Gaussian distribution with a standard deviation of 0.02. ... For pretraining, we use a cosine-annealed learning rate schedule starting at 8.0 10 4 and decaying to 1.0 10 6, with 5% warmup. The global batch size is set to 2M tokens, and the context window is set to 4096. ... For SFT tasks, we fix the learning rate and batch size to (1.0 10 5,128) for large datasets, and (1.0 10 5,32) for the smaller QMSum [30] task... The weight decay is set as 0.01, and all other hyperparameters follow the pretraining setup. |