Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

MonarchAttention: Zero-Shot Conversion to Fast, Hardware-Aware Structured Attention

Authors: Can Yaras, Alec Xu, Pierre Abillama, Changwoo Lee, Laura Balzano

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the quality of Monarch Attention on diverse tasks and architectures in vision and language problems, showing that it flexibly and accurately approximates softmax attention in a variety of contexts. Our code is available at https://github.com/cjyaras/monarch-attention. ... With optimized kernels, Monarch Attention achieves substantial speed-ups in wall-time over Flash Attention-2: 1.4 for shorter sequences (N = 256), 4.5 for medium-length sequences (N = 4K), and 8.2 for longer sequences (N = 16K).
Researcher Affiliation Academia Can Yaras University of Michigan EMAIL Alec S. Xu University of Michigan EMAIL Pierre Abillama University of Michigan EMAIL Changwoo Lee University of Michigan EMAIL Laura Balzano University of Michigan EMAIL
Pseudocode Yes Python-like code for Monarch Attention is given in Figure 4.
Open Source Code Yes Our code is available at https://github.com/cjyaras/monarch-attention.
Open Datasets Yes The Image Net-1K evaluation dataset is retrieved from the Hugging Face datasets library (Lhoest et al., 2021) as imagenet-1k using the validation split. ... The SQu AD1.1 evaluation dataset is retrieved from the Hugging Face datasets library as squad using the validation split. ... The Book Sum-chapters training/evaluation dataset is retrieved from the Hugging Face datasets library as kmfoda/booksum using the train and validation splits respectively.
Dataset Splits Yes The Image Net-1K evaluation dataset is retrieved from the Hugging Face datasets library (Lhoest et al., 2021) as imagenet-1k using the validation split. ... For evaluation, we truncate and pad to sequence length of 384. ... The Book Sum-chapters training/evaluation dataset is retrieved from the Hugging Face datasets library as kmfoda/booksum using the train and validation splits respectively. ... For evaluation, we truncate the input sequence to the corresponding sequence length in Figure 3. ... we also evaluate on the train split4.
Hardware Specification Yes Finally, we validate that the computational/IO complexity reduction achieved by Monarch Attention translates into actual speed-ups on the NVIDIA A40, a modern GPU.
Software Dependencies No The paper mentions “Triton kernels,” “PyTorch’s scaled_dot_product_attention,” and “Hugging Face transformers library” but does not specify their version numbers, which are required for a reproducible description of ancillary software.
Experiment Setup Yes To evaluate the performance at different FLOP counts, we vary the number of steps T {1, 2, 3} for Monarch Attention, and vary the rank for Performer and Nyströmformer; see Appendix D.2 for more details on the set-up. ... We fine-tune for 5 epochs with batch size of 32 and learning rate of 10 4 using the Adam optimizer (Kingma and Ba, 2014) without weight decay, with the input and summary sequences truncated and padded to 8192 and 512 tokens respectively.