MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention
Authors: Huiqiang Jiang, Yucheng LI, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, Lili Qiu
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | By evaluating on a wide range of downstream tasks, including Infinite Bench, RULER, PG-19, and Needle In A Haystack, and models including LLa MA-3-1M, GLM4-1M, Yi-200K, Phi-3-128K, and Qwen2-128K, we demonstrate that MInference effectively reduces inference latency by up to 10 for pre-filling on an A100, while maintaining accuracy. |
| Researcher Affiliation | Collaboration | Huiqiang Jiang , Yucheng Li , Chengruidong Zhang , Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H. Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, Lili Qiu Microsoft Corporation, University of Surrey {hjiang,chengzhang,yuqyang}@microsoft.com,yucheng.li@surrey.ac.uk |
| Pseudocode | Yes | Algorithm 1 Kernel-Aware Sparse Pattern Search; Algorithm 2 Vertical-Slash Head; Algorithm 3 Block-Sparse Head |
| Open Source Code | Yes | Our code is available at https://aka.ms/MInference. |
| Open Datasets | Yes | We use the provided metrics and scripts from the following benchmarks for evaluation. More details about dataset can be found in Appendix C.1. (i) Infinite Bench [ZCH+24]; (ii) RULER [HSK+24]; (iii) Needle In A Haystack [Kam23]; (iv) PG-19 [RPJ+20] |
| Dataset Splits | Yes | Additionally, we use only one sample as our validation set from KV retrieval synthetic data with 30k token inputs, which exhibits strong generalization and stability across different lengths and domains. |
| Hardware Specification | Yes | Results show that MInference speeds up the pre-filling stage by up to 10 for 1M contexts with LLa MA-3-8B on a single A100, reducing latency from 30 minutes to 3 minutes per prompt, while maintaining or improving accuracy. |
| Software Dependencies | No | The paper mentions software like Py Torch, Flash Attention [Dao24], Triton [TKC19], and PIT [ZJZ+23], but does not specify their exact version numbers. |
| Experiment Setup | Yes | We set the target FLOPs t to 1k global tokens and 4k local windows in the A-shape pattern. We set last_q = 64 and block_size = 64 in the Vertical-Slash and Block-Sparse patterns, respectively. The latency experiments are conducted on a single Nvidia A100 GPU in the bfloat16 format. |