NoMAD-Attention: Efficient LLM Inference on CPUs Through Multiply-add-free Attention
Authors: Tianyi Zhang, Jonah Yi, Bowen Yao, Zhaozhuo Xu, Anshumali Shrivastava
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive empirical evaluations demonstrate that No MADAttention maintains the quality of the original LLMs well and speeds up the 4-bit quantized LLa MA-7B-based model by up to 2 at 16k context length. |
| Researcher Affiliation | Collaboration | Tianyi Zhang Dept. of Computer Science, Rice University x MAD.ai Houston, TX tz21@rice.edu Jonah Yi Dept. of Computer Science, Rice University x MAD.ai Houston, TX jwy4@rice.edu Bowen Yao Dept. of Computer Science, Rice University Houston, TX by18@rice.edu Zhaozhuo Xu Dept. of Computer Science, Stevens Institute of Technology x MAD.ai Hoboken, NJ zxu79@stevens.edu Anshumali Shrivastava Dept. of Computer Science, Rice University Ken Kennedy Institute Third AI Corp. x MAD.ai Houston, TX anshumali@rice.edu |
| Pseudocode | Yes | Algorithm 1 Attention Score Computation in LLM; Algorithm 2 No MAD-Attention Score Computation; Algorithm 3 No MAD Dot-Product Lookup Accumulation Loop |
| Open Source Code | No | Justification: The code is proprietary to x MAD.ai. However, anyone is able to reproduce our results using the algorithm and procedure we describe in the paper. We will provide a mechanism to reproduce our numbers in the future. |
| Open Datasets | Yes | We measure the model quality with perplexity on Wiki Text-2 [30] and C4 [14] at the context length of 2048, and zero-shot accuracy (using the default configurations of lm-evaluation-harness [19]) on Sci Q [50], Arc Easy (Arc-E), Arc Challenge (Arc-C) [11], Hellaswag [54], Wino Grande [38], and PIQA [6]. The centroids for key compression of No MAD-Attention are learned on a calibration set of 16 sequences from Wiki Text-2, each with 2048 tokens. |
| Dataset Splits | Yes | The centroids for key compression of No MAD-Attention are learned on a calibration set of 16 sequences from Wiki Text-2, each with 2048 tokens. |
| Hardware Specification | Yes | Experiments for latency and throughput are performed on a Linux server equipped with an Intel Xeon E5-2695 V3 14-core CPU, which supports AVX2 SIMD instructions, and 512GB of DDR4 RAM. Experiments for accuracy and perplexity are performed on two NVIDIA A100-40GB GPUs. |
| Software Dependencies | No | Our implementation of No MAD-Attention is built in C and C++, based on the open-source projects llama.cpp [20] and FAISS [16]. We also built a GPU implementation of No MAD-Attention for quick prototyping and key-compression learning, which is based on Py Torch [33] and Hugging Face Transformers [51]. While these software components are mentioned, specific version numbers are not provided in the paper. |
| Experiment Setup | Yes | The centroids for key compression of No MAD-Attention are learned on a calibration set of 16 sequences from Wiki Text-2, each with 2048 tokens. To test the model efficiency, we benchmark the latency and throughput of Code Llama-7b [37] (with 16-bit weights and 4-bit q4_0 quantized weights), which has a longer context length of 16,384 than the LLa MA family of models. We compare the efficiency of No MAD-Attention-based models (with dsub {1, 2, 4}) against Attention-based models with a llama.cpp-based implementation. |