Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

NoMAD-Attention: Efficient LLM Inference on CPUs Through Multiply-add-free Attention

Authors: Tianyi Zhang, Jonah Yi, Bowen Yao, Zhaozhuo Xu, Anshumali Shrivastava

NeurIPS 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive empirical evaluations demonstrate that No MADAttention maintains the quality of the original LLMs well and speeds up the 4-bit quantized LLa MA-7B-based model by up to 2 at 16k context length.
Researcher Affiliation Collaboration Tianyi Zhang Dept. of Computer Science, Rice University x MAD.ai Houston, TX EMAIL Jonah Yi Dept. of Computer Science, Rice University x MAD.ai Houston, TX EMAIL Bowen Yao Dept. of Computer Science, Rice University Houston, TX EMAIL Zhaozhuo Xu Dept. of Computer Science, Stevens Institute of Technology x MAD.ai Hoboken, NJ EMAIL Anshumali Shrivastava Dept. of Computer Science, Rice University Ken Kennedy Institute Third AI Corp. x MAD.ai Houston, TX EMAIL
Pseudocode Yes Algorithm 1 Attention Score Computation in LLM; Algorithm 2 No MAD-Attention Score Computation; Algorithm 3 No MAD Dot-Product Lookup Accumulation Loop
Open Source Code No Justification: The code is proprietary to x MAD.ai. However, anyone is able to reproduce our results using the algorithm and procedure we describe in the paper. We will provide a mechanism to reproduce our numbers in the future.
Open Datasets Yes We measure the model quality with perplexity on Wiki Text-2 [30] and C4 [14] at the context length of 2048, and zero-shot accuracy (using the default configurations of lm-evaluation-harness [19]) on Sci Q [50], Arc Easy (Arc-E), Arc Challenge (Arc-C) [11], Hellaswag [54], Wino Grande [38], and PIQA [6]. The centroids for key compression of No MAD-Attention are learned on a calibration set of 16 sequences from Wiki Text-2, each with 2048 tokens.
Dataset Splits Yes The centroids for key compression of No MAD-Attention are learned on a calibration set of 16 sequences from Wiki Text-2, each with 2048 tokens.
Hardware Specification Yes Experiments for latency and throughput are performed on a Linux server equipped with an Intel Xeon E5-2695 V3 14-core CPU, which supports AVX2 SIMD instructions, and 512GB of DDR4 RAM. Experiments for accuracy and perplexity are performed on two NVIDIA A100-40GB GPUs.
Software Dependencies No Our implementation of No MAD-Attention is built in C and C++, based on the open-source projects llama.cpp [20] and FAISS [16]. We also built a GPU implementation of No MAD-Attention for quick prototyping and key-compression learning, which is based on Py Torch [33] and Hugging Face Transformers [51]. While these software components are mentioned, specific version numbers are not provided in the paper.
Experiment Setup Yes The centroids for key compression of No MAD-Attention are learned on a calibration set of 16 sequences from Wiki Text-2, each with 2048 tokens. To test the model efficiency, we benchmark the latency and throughput of Code Llama-7b [37] (with 16-bit weights and 4-bit q4_0 quantized weights), which has a longer context length of 16,384 than the LLa MA family of models. We compare the efficiency of No MAD-Attention-based models (with dsub {1, 2, 4}) against Attention-based models with a llama.cpp-based implementation.