reproducibilityindex.ai

MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention

Authors: Huiqiang Jiang, Yucheng LI, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, Lili Qiu

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	By evaluating on a wide range of downstream tasks, including Infinite Bench, RULER, PG-19, and Needle In A Haystack, and models including LLa MA-3-1M, GLM4-1M, Yi-200K, Phi-3-128K, and Qwen2-128K, we demonstrate that MInference effectively reduces inference latency by up to 10 for pre-filling on an A100, while maintaining accuracy.
Researcher Affiliation	Collaboration	Huiqiang Jiang , Yucheng Li , Chengruidong Zhang , Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H. Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, Lili Qiu Microsoft Corporation, University of Surrey {hjiang,chengzhang,yuqyang}@microsoft.com,yucheng.li@surrey.ac.uk
Pseudocode	Yes	Algorithm 1 Kernel-Aware Sparse Pattern Search; Algorithm 2 Vertical-Slash Head; Algorithm 3 Block-Sparse Head
Open Source Code	Yes	Our code is available at https://aka.ms/MInference.
Open Datasets	Yes	We use the provided metrics and scripts from the following benchmarks for evaluation. More details about dataset can be found in Appendix C.1. (i) Infinite Bench [ZCH+24]; (ii) RULER [HSK+24]; (iii) Needle In A Haystack [Kam23]; (iv) PG-19 [RPJ+20]
Dataset Splits	Yes	Additionally, we use only one sample as our validation set from KV retrieval synthetic data with 30k token inputs, which exhibits strong generalization and stability across different lengths and domains.
Hardware Specification	Yes	Results show that MInference speeds up the pre-filling stage by up to 10 for 1M contexts with LLa MA-3-8B on a single A100, reducing latency from 30 minutes to 3 minutes per prompt, while maintaining or improving accuracy.
Software Dependencies	No	The paper mentions software like Py Torch, Flash Attention [Dao24], Triton [TKC19], and PIT [ZJZ+23], but does not specify their exact version numbers.
Experiment Setup	Yes	We set the target FLOPs t to 1k global tokens and 4k local windows in the A-shape pattern. We set last_q = 64 and block_size = 64 in the Vertical-Slash and Block-Sparse patterns, respectively. The latency experiments are conducted on a single Nvidia A100 GPU in the bfloat16 format.