Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

TSENOR: Highly-Efficient Algorithm for Finding Transposable N:M Sparse Masks

Authors: Xiang Meng, Mehdi Makni, Rahul Mazumder

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show that LLa MA3.2-8B with transposable 16:32 sparsity maintains performance close to its standard N:M counterpart and outperforms standard 2:4 sparse model, showing the practical value of our approach. Our code is available at https://github.com/mazumder-lab/TSENOR. We evaluate our proposed transposable binary mask solver against several approaches: (i) Network Flow method [Hubara et al., 2021b], which guarantees optimal solutions through bipartite matching algorithms; (ii) cu PDLP [Lu and Yang, 2023], a general-purpose GPU-accelerated linear programming solver; (iii) 2-Approximation [Hubara et al., 2021b], a greedy-based heuristic; (iv) Bi-NM adopted from [Zhang et al., 2023], which sequentially applies row-wise and then column-wise N:M sparsity; and (v) Max1000, a randomized baseline that generates 1000 feasible masks and selects the best one. Our experiments focus on transposable N:M sparsity with M 8, as smaller patterns (M=4) can already be optimally and efficiently solved [Hu et al., 2024].
Researcher Affiliation Academia Xiang Meng Operations Research Center Massachusetts Institute of Technology EMAIL Mehdi Makni Operations Research Center Massachusetts Institute of Technology EMAIL Rahul Mazumder Operations Research Center Massachusetts Institute of Technology EMAIL
Pseudocode Yes Algorithm 1 Dykstra s algorithm for solving entropy-regularized optimal transport problem Input: Weight matrix W, regularization parameter τ > 0, and maximum iterations T Output: Solution S to problem (4) 1: Initialize S(0) = exp(τ|W|), and dual variable Q(0) = 1M M 2: for t = 0, 1, . . . , T 1 do 3: S(t) Diag N/(S(t)1M) S(t) Projection onto C1 4: S(t) S(t)Diag N/(S(t) 1M) Projection onto C2 5: S(t+1) min S(t) Q(t), 1 Projection onto C3 6: Q(t+1) Q(t) (S(t) S(t+1)) Update dual variable 7: return S(T ) Algorithm 2 A binary mask generation approach based on greedy selection and local search Input: Approximate solution Sa from Algorithm 1 and number of local search step L. Output: Feasible binary solution to problem (1) 1: Initialize S = 0M M, row counter R = 0M, column counter C = 0M 2: {(it, jt)}M 2 t=1 argsort({ Sa ij}M i,j=1) Sort elements in descending order 3: for t = 1, 2, . . . , M 2 do greedy selection 4: if Rit < N and Cjt < N then 5: Set Sit,jt 1 6: Update (Rit, Cjt) (Rit + 1, Cjt + 1) 7: for t = 1, 2, . . . , L do L local search steps 8: if Ri = N, Ci = N, i [M] then 9: break 10: Pick i, j [M] such that Ri < N, Cj < N. 11: Select (i , j ) = arg maxi ,j Swap(i , j ) as defined in Eq.(6). 12: if Swap(i , j ) > 0 then 13: Set (Si ,j , Si ,j, Si,j ) (0, 1, 1), (Ri, Cj) (Ri + 1, Cj + 1). 14: return Feasible binary mask S
Open Source Code Yes Our code is available at https://github.com/mazumder-lab/TSENOR.
Open Datasets Yes Perplexity is computed following Hugging Face s methodology [Per, 2022] with full stride on raw-Wiki Text2 [Merity et al., 2017], PTB [Marcus et al., 1994], and C4 [Raffel et al., 2020] validation subset. Zero-shot evaluation uses LM Harness [Gao et al.] on PIQA [Bisk et al., 2020], ARC-E and ARC-C [Clark et al., 2018], Hellaswag [Zellers et al., 2019], Winogrande [Sakaguchi et al., 2021], RTE [Poliak, 2020], Openbook QA [Banerjee et al., 2019], and Bool Q [Clark et al., 2019].
Dataset Splits Yes Perplexity is computed following Hugging Face s methodology [Per, 2022] with full stride on raw-Wiki Text2 [Merity et al., 2017], PTB [Marcus et al., 1994], and C4 [Raffel et al., 2020] validation subset. Zero-shot evaluation uses LM Harness [Gao et al.] on PIQA [Bisk et al., 2020], ARC-E and ARC-C [Clark et al., 2018], Hellaswag [Zellers et al., 2019], Winogrande [Sakaguchi et al., 2021], RTE [Poliak, 2020], Openbook QA [Banerjee et al., 2019], and Bool Q [Clark et al., 2019]. Training is conducted on the first shard of the C4 training dataset, which contains over 150 million tokens.
Hardware Specification Yes Unless otherwise specified, we utilized an Intel Xeon Gold 6248 machine with 20 CPU cores and a single NVIDIA A100 GPU, featuring 192GB of system RAM and 40GB of GPU memory. Table 1: Runtime (seconds) for transposable 8:16 sparsity. For GPU methods, we test on NVIDIA V100-PCIe-32GB, A100-PCIe-40GB, and H100-PCIe-80GB. CPU methods use 16-core parallel processing.
Software Dependencies No All language models and pruning methods were implemented using the Py Torch library Paszke et al. [2017]. We employ the official Julia implementation from [Lu and Yang, 2023] (accessible via Git Hub).
Experiment Setup Yes For Algorithm 1, we set the regularization parameter τ to 0.005 maxij |Wij| and limit the maximum iterations to T = 300. In Algorithm 2, we perform L = 10 local search steps to refine the solution. For both finetuning and Bi-NM retraining, we use a learning rate of 2e-5 and a batch size of size 64 per step. The block size used is 1024 tokens per batch. The effective batch size is obtained by using a physical batch size of 2 on GPU with 32 gradient accumulation steps before each weight update. Training is conducted on the first shard of the C4 training dataset, which contains over 150 million tokens. We employ the Adam optimizer with Py Torch s default hyperparameters. A cosine learning rate scheduler is used, with a warmup ratio of 0.03 and no weight decay applied.