Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Hardware-aligned Hierarchical Sparse Attention for Efficient Long-term Memory Access
Authors: Xiang Hu, Jiaqi Leng, Jun Zhao, Kewei Tu, Wei Wu
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In our experiments, we compare RAMba with baselines such as Transformers, Mamba-2, and their variants with sliding window attention [14] and NSA [70], evaluating performance across long-range language modeling, downstream tasks, and efficiency. RAMba consistently outperforms the baselines in long-context modeling and downstream tasks while exhibiting exceptional length generalization. Notably, it is the first Mamba-based model to achieve perfect accuracy on a 64M context in the passkey retrieval task. In terms of efficiency, HSA is 3 faster than NSA and 5 25 faster than full attention for contexts of 16K tokens or more during the forward pass. Additionally, when memory offloading is enabled, RAMba maintains nearly constant memory usage. These results demonstrate RAMba s superior capability in long-text modeling. In summary, our contributions are threefold: 1. We propose HSA, a novel hierarchical attention mechanism paired with a hardware-efficient algorithm that simultaneously enables efficiency, length generalization, and flexible long-range random access. 2. Based on HSA, we introduce RAMba, which integrates the advantages of the attention mechanism into Mamba while maintaining a nearly constant memory footprint during inference. 3. We conducted comprehensive experiments on the length generalization of Mamba with various attention mechanisms. The results show that HSA excels in both performance and efficiency. |
| Researcher Affiliation | Collaboration | Xiang Hu1, Jiaqi Leng2, Jun Zhao2, Kewei Tu3 , Wei Wu1 1Ant Group, 2Fudan University, 3Shanghai Tech University EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1 FORWARD thread t 1: O t 0 // Initialize O t Rh dh 2: Q load Qt // load Qt to Static RAM (SRAM), Q Rl h dh, Q Rh dh 3: for 1 k K do 4: i load It,k, w load wt,k // I Zl K, w Rl K 5: K load Ki, V load Vi // K, V Rl h dh K , V RS dh 6: O softmax1(Q K )V // Inter-chunk token-level attention, no online softmax required. 7: O t O t + w O // Chunk-level attention via weighted sum. 8: end for 9: O t write to Ot // Write to High Bandwidth Memory (HBM) from Static RAM (SRAM). Algorithm 2 BACKWARD-Q, w thread t Q 0, Q load Qt, O load Ot for 1 k K do i load It,k, w load wt,k K , V load Ki, Vi // K , V RS dh P softmax1(Q K ) // P Rh S O PV // O Rh dh D rowsum(O O ) // pointwise multiply // D Rh, O Rh dh D write to Dt,k // D Rl K h w rowsum(D ) // w R w write to wt,k // w Rl K P O V // P Rh S S w P ( P D ) // S Rh S Qt Qt + SK // Qt Rh dh end for Q write to Qt Algorithm 3 BACKWARD-K,V thread i K , V load Ki, Vi // K , V RS dh K , V 0 // K , V RS dh for 1 t l do if Mt,i is true then k load Rt,i, w load wt,i Q load Qt // Q Rh dh O load Ot // O Rh dh D load Dt,k // D Rg P softmax1(Q K ) // P Rh S V V + w P O P O V // P Rh S S w P ( P D ) // S Rh S end if K , V write to Ki, Vi end for |
| Open Source Code | Yes | https://github.com/ant-research/long-context-modeling |
| Open Datasets | Yes | All models are pre-trained on the same 60-billion-token subset of the Pile dataset [20]. Detailed training hyper-parameters are provided in Appendix C. *https://github.com/fla-org/native-sparse-attention Models(370M) pg19 arxiv code pg19 arxiv code pg19 arxiv code eval_len=4k eval_len=16k eval_len=64k Transformerfull_attn 18.61 4.23 3.28 539.15 199.42 62.17 >104 >104 2865.51 Mamba 17.92 4.24 3.28 17.38 3.91 3.09 17.30 3.86 3.05 w/ SWAALi Bi 17.82 4.21 3.26 20.48 5.01 3.53 23.86 6.46 3.96 w/ SWArope 17.82 4.21 3.26 17.50 4.03 3.19 17.80 4.25 3.35 w/ NSAw/ m.r. 17.87 4.20 3.25 17.31 3.87 3.06 17.31 3.87 3.05 w/ NSAw/o m.r. 17.74 4.18 3.24 17.56 4.29 3.26 17.62 4.35 3.28 RAMbaw/ m.r. 17.82 4.15 3.23 17.15 3.73 3.04 17.01 3.65 3.07 RAMbaw/o m.r. 17.63 4.13 3.21 17.11 3.81 3.08 17.11 3.87 3.21 RAMbaw/o m.r., w/o s.b. 18.07 4.52 3.34 17.61 5.01 3.17 17.61 6.05 3.16 Table 1: Perplexity for long-range language modeling. We highlight the best results in bold and underline the second best. All models are pre-trained on 4K contexts. 4.2 Long-range Langauge Modeling Datasets. We evaluate long-range language modeling on PG19 [48], Ar Xiv-math [4], and Code [65]. Tasks. We evaluate various models long-context modeling abilities on classic tasks like passkey retrieval [35] and the Long Bench V2 dataset [7]. To increase task difficulty, we replace numbers in passkey retrieval with random token sequences. Since passkey retrieval is relatively simple, we further fine-tuned the models using synthetic data following RULER [25]. We use a context length of 4K for fine-tuning with a total training step size equivalent to 5% of the pre-training stage. Evaluations were conducted across different lengths on four RULER tasks: Single NIAH (S-N), Multi-queries NIAH (MQ-N), Variable Tracking (VT), and Frequent Words Extraction (FWE). To align with passkey retrieval, keys in Single NIAH were also replaced with random token sequences. We adopt a Cloze format for Long Bench evaluation, following Waleffe et al. [62], to address the instruction-following challenges of small models. Since Long Bench V2 is a zero-shot benchmark and thus small models may exhibit randomness, we additionally evaluate on fine-tunable datasets, including summarization tasks like XSUM [39] and CNN [38], and QA tasks like SQua D [49], Hotpot QA [68], and Qu ALITY [40]. |
| Dataset Splits | Yes | To ensure a fair comparison, we pre-train all 370M models from scratch with 4K context length to observe their performance and extrapolation capabilities across various tasks. For 2.7B models, the training details are presented in Appendix E. ... For an 8K token context, the memory is reset every 4K tokens, which aligns with the context length of other baselines. However, the chunk selection scope for sparse attention spans 8K tokens, which might be unfair to other baselines. To ensure fair comparisons, we apply the same settings to both HSA and NSA. ... We used a context length of 4K for fine-tuning with a total training step size equivalent to 5% of the pre-training stage. ... Appendix E: E.1 Post-training ... Specifically, we utilized BPTT for post-tuning the base model. We trained the model on sequences of 32K tokens with a batch size of 16 for 3K steps, totaling 1.5B tokens. This stage takes 5 hours on 32 PPUs. ... Warmup. ... We train the model with a 32K context length, batch size of 16, for 16K steps, with a peak learning rate of 2 10 5, totaling 8B tokens. This stage takes around 24 hours on 32 PPUs. Post-Training. ... The model is trained with a context length of 32K, a batch size of 16, for 32K steps, using a peak learning rate of 2 10 5, totaling 16B tokens. This stage takes around 48 hours on 32 PPUs. E.2 RULER finetuning ... We conduct pre-training on 60B tokens, which amounts to one-tenth of the Mamba-2 2.7B model, followed by fine-tuning on 1B synthetic data, which takes around 200 hours on 32 PPUs. |
| Hardware Specification | Yes | When measuring training throughput, we enable FSDP [75] and gradient checkpointing [12], running models on 16 Physics Processing Units (PPUs), each with approximately half the computational power of an A100 GPU. ... This stage takes 5 hours on 32 PPUs. ... This stage takes around 24 hours on 32 PPUs. ... This stage takes around 48 hours on 32 PPUs. ... This approach aims to evaluate whether RAMba trained from scratch can stably converge and demonstrate long-range retrieval capabilities. We conduct pre-training on 60B tokens, which amounts to one-tenth of the Mamba-2 2.7B model, followed by fine-tuning on 1B synthetic data, which takes around 200 hours on 32 PPUs. |
| Software Dependencies | No | In HSA, each token corresponds to a distinct set of K chunks, which can lead to a substantial memory footprint in a naive implementation. Inspired by NSA, we address this issue by implementing hardware-aligned HSA kernels based on Triton [57]. |
| Experiment Setup | Yes | To ensure a fair comparison, we pre-train all 370M models from scratch with 4K context length to observe their performance and extrapolation capabilities across various tasks. For 2.7B models, the training details are presented in Appendix E. Baselines. We adopt the Mamba-2 architecture as the backbone of the RNN model and Ya RN [44] as the Transformer baseline. The parameter size of all models trained from scratch is 370M, with detailed parameters provided in Appendix B. We experiment with Mamba variants with different attention mechanisms, including sliding window attention, native sparse attention (NSA), and HSA. For sliding window attention, the window size is set to 512, incorporating two position encoding schemes: ALi Bi [45] and Ro PE, the latter following the settings in Samba [51]. We set the chunk size of HSA to 64 following NSA. To ensure that the field of view for sparse attention matches the sliding window size (64 * 8 = 512), we set the number of selected chunks to 8. For NSA, we use its efficient open-source implementation *. To isolate the effects of the sparse attention components, we disable the sliding window attention in NSA. The HSA incorporates a single-layered Transformer-based bi-directional encoder for chunk memory encoding, accounting for 5.4% of the total parameters, whose impact on fairness is minimal. HSA layers are inserted into the upper decoder every G = 8 Mamba layers, with other attention mechanisms like SWA and NSA following the same pattern. These settings remain consistent across all subsequent 370M models. Since the compressed attention in NSA functions similarly to Combiner [50], we do not conduct a separate comparison against Combiner. Some other related works [36] are not included in the experiments due to the lack of open-source implementations. Pre-training. All models are pre-trained on the same 60-billion-token subset of the Pile dataset [20]. Detailed training hyper-parameters are provided in Appendix C. *https://github.com/fla-org/native-sparse-attention ... Appendix C Training hyper-parameters All 370M models used the Adam W optimizer with linear learning rate warmup with warmup ratio 0.02, cosine decay to 4e 5. peak learning rate 2e 3. total tokens 60B, batch size 1M tokens. gradient clip value 1.0 no linear bias terms weight decay 1e 3 Adam W hyperparameter β = (.9, .95) (the GPT3 value) All models are pre-trained on 16 PPUs, with each taking approximately 60 hours. |