Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
A Simple and Effective Pruning Approach for Large Language Models
Authors: Mingjie Sun, Zhuang Liu, Anna Bair, J Zico Kolter
ICLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct a thorough evaluation of our method Wanda on LLa MA and LLa MA-2 across various language benchmarks. We empirically evaluate Wanda on the widely adopted LLa MA (Touvron et al., 2023a) and LLa MA2 (Touvron et al., 2023b) model families. |
| Researcher Affiliation | Collaboration | 1Carnegie Mellon University 2Meta AI Research 3Bosch Center for AI |
| Pseudocode | Yes | Algorithm 1 Py Torch code for Wanda |
| Open Source Code | Yes | Code is available at https://github.com/locuslab/wanda. |
| Open Datasets | Yes | To control this variable factor, we use the exact same set of calibration data as Sparse GPT, which consists of 128 sequences with context length size sampled from C4 training set (Raffel et al., 2020). |
| Dataset Splits | Yes | we evaluate the perplexity on the held-out Wiki Text (Merity et al., 2016) validation set. |
| Hardware Specification | Yes | Specifically, we measure the accumulated time for computing the pruning metric at each layer (excluding the forward pass process shared by both methods) on NVIDIA A6000 GPUs. We evaluate the inference speedup for structured 2:4 sparsity on NVIDIA A6000 GPUs. |
| Software Dependencies | No | The paper mentions 'Py Torch code' in Algorithm 1, but does not provide specific version numbers for PyTorch or any other software dependencies. |
| Experiment Setup | Yes | For all pruning methods, we focus on pruning the linear layers (skipping the first embedding layer and the final classification head), which account for around 99% of the total LLM parameters. We impose a uniform sparsity for all linear layers. We use the exact same set of calibration data as Sparse GPT, which consists of 128 sequences with context length size sampled from C4 training set (Raffel et al., 2020). We investigate two strategies for fine-tuning LLMs: Lo RA (Hu et al., 2021) fine-tuning and full parameter dense fine-tuning. Fine-tuning is conducted on C4 training dataset and the objective is the pre-training auto-regressive loss. The pruned mask is kept fixed during fine-tuning. We enforce a limited computational budget (1 GPU and 12 hours). |