Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Polar Sparsity: High Throughput Batched LLM Inferencing with Scalable Contextual Sparsity
Authors: Susav Shrestha, Bradley Settlemyer, Nikoli Dryden, Narasimha Reddy
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We develop Selective Head Attention with hardware-efficient, sparsity-aware GPU kernels, delivering up to 2.2 end-to-end speedups for models like OPT, LLa MA2 & 3, Qwen, Mistral across various batch sizes and sequence lengths without compromising accuracy. To our knowledge, this is the first work to demonstrate that contextual sparsity can scale effectively to large batch sizes, delivering substantial inference acceleration with minimal changes, making Polar Sparsity practical for large-scale, high-throughput LLM deployment systems. |
| Researcher Affiliation | Collaboration | Susav Shrestha Texas A&M University EMAIL Brad Settlemyer NVIDIA EMAIL Nikoli Dryden Lawrence Livermore National Laboratory EMAIL Narasimha Reddy Texas A&M University EMAIL |
| Pseudocode | Yes | Algorithm 1 Selective Head Flash Attention (Decode) Require: Q RB H 1 d, K, V RB H Nkv d, batch_head_index ZB top_k, MSRAM, s = 1/ d Output: O RB H 1 d (written to HBM) 1: Determine target batch index b and top-k index k assigned to this unit from the grid dimensions. 2: head_idx batch_head_index[b, k] Get the actual head index to compute 3: Bc = MSRAM/(4d) ; (Oacc, lacc, macc) ( 0, 0, ); Tc = Nkv/Bc 4: Load q R1 d from Q[b, head_idx, 0, :] Get the activated query vector for the batch 5: for j = 1 to Tc do 6: kstart = (j 1)Bc, kend = min(j Bc, Nkv); Load Kj, Vj from K, V[b, head_idx, kstart : kend, :] 7: Sj = s(q@KT j ); mj = max(Sj); Pj = exp(Sj mj); lj = P Pj 8: mnew = max(macc, mj); α = emacc mnew; β = e mj mnew; lnew = αlacc + β lj 9: Oacc (αlacc Oacc + β( Pj@Vj))/lnew; lacc lnew; macc mnew 10: end for 11: Write Oacc to O[b, head_idx, 0, :] |
| Open Source Code | Yes | Our code is available at: https://github.com/susavlsh10/Polar-Sparsity. |
| Open Datasets | Yes | To train the routers, we collected 400,000 token samples from random text sequences extracted from the Wikitext-2 training dataset. |
| Dataset Splits | Yes | To train the routers, we collected 400,000 token samples from random text sequences extracted from the Wikitext-2 training dataset. |
| Hardware Specification | Yes | All experiments are performed on NVIDIA DGX A100 80GB GPU node servers. |
| Software Dependencies | No | We built on top of the Flash Attention triton kernel and utilized CUDA graphs to measure the decoding throughput with the included routers. |
| Experiment Setup | Yes | We use a batch size of 64, a learning rate of 1e-4, and early stopping over a maximum of 20 epochs. The LLM parameters are frozen during router training. Supervision data is collected from inference runs on the Wiki Text-2 dataset, as described in Section 5. To determine minimal top-k values for the MLP layers, we apply a simple greedy algorithm (Algorithm 2) that incrementally adjusts the threshold to meet the target recall of 99%. |