A Simple and Effective Pruning Approach for Large Language Models
Authors: Mingjie Sun, Zhuang Liu, Anna Bair, J Zico Kolter
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct a thorough evaluation of our method Wanda on LLa MA and LLa MA-2 across various language benchmarks. We empirically evaluate Wanda on the widely adopted LLa MA (Touvron et al., 2023a) and LLa MA2 (Touvron et al., 2023b) model families. |
| Researcher Affiliation | Collaboration | 1Carnegie Mellon University 2Meta AI Research 3Bosch Center for AI |
| Pseudocode | Yes | Algorithm 1 Py Torch code for Wanda |
| Open Source Code | Yes | Code is available at https://github.com/locuslab/wanda. |
| Open Datasets | Yes | To control this variable factor, we use the exact same set of calibration data as Sparse GPT, which consists of 128 sequences with context length size sampled from C4 training set (Raffel et al., 2020). |
| Dataset Splits | Yes | we evaluate the perplexity on the held-out Wiki Text (Merity et al., 2016) validation set. |
| Hardware Specification | Yes | Specifically, we measure the accumulated time for computing the pruning metric at each layer (excluding the forward pass process shared by both methods) on NVIDIA A6000 GPUs. We evaluate the inference speedup for structured 2:4 sparsity on NVIDIA A6000 GPUs. |
| Software Dependencies | No | The paper mentions 'Py Torch code' in Algorithm 1, but does not provide specific version numbers for PyTorch or any other software dependencies. |
| Experiment Setup | Yes | For all pruning methods, we focus on pruning the linear layers (skipping the first embedding layer and the final classification head), which account for around 99% of the total LLM parameters. We impose a uniform sparsity for all linear layers. We use the exact same set of calibration data as Sparse GPT, which consists of 128 sequences with context length size sampled from C4 training set (Raffel et al., 2020). We investigate two strategies for fine-tuning LLMs: Lo RA (Hu et al., 2021) fine-tuning and full parameter dense fine-tuning. Fine-tuning is conducted on C4 training dataset and the objective is the pre-training auto-regressive loss. The pruned mask is kept fixed during fine-tuning. We enforce a limited computational budget (1 GPU and 12 hours). |