MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention

Authors: Huiqiang Jiang, Yucheng LI, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, Lili Qiu

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental By evaluating on a wide range of downstream tasks, including Infinite Bench, RULER, PG-19, and Needle In A Haystack, and models including LLa MA-3-1M, GLM4-1M, Yi-200K, Phi-3-128K, and Qwen2-128K, we demonstrate that MInference effectively reduces inference latency by up to 10 for pre-filling on an A100, while maintaining accuracy.
Researcher Affiliation Collaboration Huiqiang Jiang , Yucheng Li , Chengruidong Zhang , Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H. Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, Lili Qiu Microsoft Corporation, University of Surrey {hjiang,chengzhang,yuqyang}@microsoft.com,yucheng.li@surrey.ac.uk
Pseudocode Yes Algorithm 1 Kernel-Aware Sparse Pattern Search; Algorithm 2 Vertical-Slash Head; Algorithm 3 Block-Sparse Head
Open Source Code Yes Our code is available at https://aka.ms/MInference.
Open Datasets Yes We use the provided metrics and scripts from the following benchmarks for evaluation. More details about dataset can be found in Appendix C.1. (i) Infinite Bench [ZCH+24]; (ii) RULER [HSK+24]; (iii) Needle In A Haystack [Kam23]; (iv) PG-19 [RPJ+20]
Dataset Splits Yes Additionally, we use only one sample as our validation set from KV retrieval synthetic data with 30k token inputs, which exhibits strong generalization and stability across different lengths and domains.
Hardware Specification Yes Results show that MInference speeds up the pre-filling stage by up to 10 for 1M contexts with LLa MA-3-8B on a single A100, reducing latency from 30 minutes to 3 minutes per prompt, while maintaining or improving accuracy.
Software Dependencies No The paper mentions software like Py Torch, Flash Attention [Dao24], Triton [TKC19], and PIT [ZJZ+23], but does not specify their exact version numbers.
Experiment Setup Yes We set the target FLOPs t to 1k global tokens and 4k local windows in the A-shape pattern. We set last_q = 64 and block_size = 64 in the Vertical-Slash and Block-Sparse patterns, respectively. The latency experiments are conducted on a single Nvidia A100 GPU in the bfloat16 format.