MaskLLM: Learnable Semi-Structured Sparsity for Large Language Models

Authors: Gongfan Fang, Hongxu Yin, Saurav Muralidharan, Greg Heinrich, Jeff Pool, Jan Kautz, Pavlo Molchanov, Xinchao Wang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This work introduces Mask LLM, a learnable pruning method that establishes Semi-structured (or N:M ) Sparsity in LLMs, aimed at reducing computational overhead during inference. ... We assessed Mask LLM using 2:4 sparsity on various LLMs, including LLa MA-2, Nemotron-4, and GPT-3, with sizes ranging from 843M to 15B parameters, and our empirical results show substantial improvements over state-of-the-art methods.
Researcher Affiliation Collaboration NVIDIA National University of Singapore
Pseudocode Yes The learning process of Mask LLM is straightforward. We begin with randomly initialized logits and update it with prior masks as Equation 10 if available. Then we optimize the logits to solve the objective in Equation 8. The mask Mi with the largest logits will be taken as the final mask for inference. This process is summarized in Algorithm 1.
Open Source Code Yes Code is available at https://github.com/NVlabs/Mask LLM.
Open Datasets Yes For LLa MA-2 and Nemotron-4, we collected a blended training set following the original papers [36, 31] for training. For the GPT-3 multilingual models, we used the original training set for mask learning. For evaluation, we follow Sparse GPT [12] to use C4 dataset [34] for one-shot pruning and Wikitext [28] for evaluation.
Dataset Splits No The paper mentions training data and evaluation data (Wikitext), and calibration data (C4) for one-shot methods, but it does not specify a distinct validation set split for its own Mask LLM training process.
Hardware Specification Yes We used 64 A100 GPUs during training with an 8-way tensor parallel configuration... In Table 16, we benchmark the throughput of LLa MA-2 7B with 2:4 sparsity on an A6000 GPU using Tensor RT-LLM for a batch size of 1.
Software Dependencies No The paper mentions "Tensor RT-LLM" in Appendix H but does not specify its version or any other software dependencies with explicit version numbers.
Experiment Setup Yes We summarize the hyper-parameters used in our experiments in Table 7. The main results of hyperparameter tuning are available in Table 10, where we assessed different temperature, logit scaling factors and prior strength with GPT-3 843M.