Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention
Authors: Huiqiang Jiang, Yucheng LI, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, Lili Qiu
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | By evaluating on a wide range of downstream tasks, including Infinite Bench, RULER, PG-19, and Needle In A Haystack, and models including LLa MA-3-1M, GLM4-1M, Yi-200K, Phi-3-128K, and Qwen2-128K, we demonstrate that MInference effectively reduces inference latency by up to 10 for pre-filling on an A100, while maintaining accuracy. |
| Researcher Affiliation | Collaboration | Huiqiang Jiang , Yucheng Li , Chengruidong Zhang , Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H. Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, Lili Qiu Microsoft Corporation, University of Surrey EMAIL,EMAIL |
| Pseudocode | Yes | Algorithm 1 Kernel-Aware Sparse Pattern Search; Algorithm 2 Vertical-Slash Head; Algorithm 3 Block-Sparse Head |
| Open Source Code | Yes | Our code is available at https://aka.ms/MInference. |
| Open Datasets | Yes | We use the provided metrics and scripts from the following benchmarks for evaluation. More details about dataset can be found in Appendix C.1. (i) Infinite Bench [ZCH+24]; (ii) RULER [HSK+24]; (iii) Needle In A Haystack [Kam23]; (iv) PG-19 [RPJ+20] |
| Dataset Splits | Yes | Additionally, we use only one sample as our validation set from KV retrieval synthetic data with 30k token inputs, which exhibits strong generalization and stability across different lengths and domains. |
| Hardware Specification | Yes | Results show that MInference speeds up the pre-filling stage by up to 10 for 1M contexts with LLa MA-3-8B on a single A100, reducing latency from 30 minutes to 3 minutes per prompt, while maintaining or improving accuracy. |
| Software Dependencies | No | The paper mentions software like Py Torch, Flash Attention [Dao24], Triton [TKC19], and PIT [ZJZ+23], but does not specify their exact version numbers. |
| Experiment Setup | Yes | We set the target FLOPs t to 1k global tokens and 4k local windows in the A-shape pattern. We set last_q = 64 and block_size = 64 in the Vertical-Slash and Block-Sparse patterns, respectively. The latency experiments are conducted on a single Nvidia A100 GPU in the bfloat16 format. |