ZipLM: Inference-Aware Structured Pruning of Language Models
Authors: Eldar Kurtić, Elias Frantar, Dan Alistarh
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we progress towards resolving this problem by proposing a novel structured compression approach for LLMs, called Zip LM. Zip LM achieves state-of-the-art accuracy-vs-speedup, while matching a set of desired target runtime speedups in any given inference environment. ... When compressing GPT2, Zip LM outperforms Distil GPT2 while being 60% smaller and 30% faster. Our code is available at: https://github.com/IST-DASLab/Zip LM. 4 Experiments Setup. Given a pre-trained model, a dataset, and a set of desired speedups in a target inference environment, we iteratively fine-tune and prune the model in a structured way such that in the end we obtain a set of accurate compressed models, one for each speedup target. We consider pruning of the standard BERTbase and BERTlarge architectures, evaluating on dev-sets of established benchmarks: SQu ADv1.1 [42], and a subset of GLUE [54] tasks... |
| Researcher Affiliation | Collaboration | Eldar Kurtic IST Austria eldar.kurtic@ist.ac.at Elias Frantar IST Austria elias.frantar@ist.ac.at Dan Alistarh IST Austria & Neural Magic dan.alistarh@ist.ac.at |
| Pseudocode | Yes | Algorithm 1 The Zip LM pruning algorithm. Given inverse Hessian H 1 = (2XX + λI) 1, we remove exactly k structures from the corresponding weight matrix W. |
| Open Source Code | Yes | Our code is available at: https://github.com/IST-DASLab/Zip LM. |
| Open Datasets | Yes | We consider pruning of the standard BERTbase and BERTlarge architectures, evaluating on dev-sets of established benchmarks: SQu ADv1.1 [42], and a subset of GLUE [54] tasks: SST-2 [48], QNLI [54], MNLI [57], and QQP [46], selected to match publicly-available checkpoints from prior work. ... we also consider pruning of the decoder-based GPT2 model on the Open Web Text Corpus [10], for which we consider two inference environments: pruning for throughput (batch-size=16, sequence-length=1024), and pruning for latency (batch-size=1, a set of prompts with varying lengths). For illustration, our pipeline is depicted in Figure 1. In Appendix H and I, we report exact values for all results, as well as hyper-parameters for reproducibility. ... on the test-split of the Wiki Text [37] dataset. |
| Dataset Splits | Yes | We consider pruning of the standard BERTbase and BERTlarge architectures, evaluating on dev-sets of established benchmarks: SQu ADv1.1 [42], and a subset of GLUE [54] tasks: SST-2 [48], QNLI [54], MNLI [57], and QQP [46], selected to match publicly-available checkpoints from prior work. For a fair comparison, we follow the Distil GPT2 setup [44] and prune the 124M parameters GPT2 variant on the Open Web Text Corpus dataset, followed by zero-shot evaluations, without any finetuning, on the test-split of the Wiki Text [37] dataset. Evaluating and comparing compressed models on the development set (dev-set) is standard practice, as it enables comparisons with off-the-shelf results from the literature. |
| Hardware Specification | Yes | For a precise comparison to prior work [59], our inference environment is a single NVIDIA V100 16GB GPU, batch size of 128, and sequence lengths of 384 and 128 for SQu AD and GLUE tasks, respectively. In terms of end-to-end runtime, Zip LM produces the entire family of compressed BERTbase models on a single RTX A6000 GPU in 35 hours on larger datasets (e.g. MNLI) and only 10 hours on smaller ones (e.g. SST2). We benchmark these compound compressed models by running inference in the Deep Sparse [39] engine, on a single-core of Intel Cascade Lake CPU. |
| Software Dependencies | No | The paper mentions software like 'Py Torch-Hugging Face framework', 'Transformers library [58]', and 'Sparse ML [25]' but does not provide specific version numbers for these software dependencies, which are required for reproducible setup. |
| Experiment Setup | Yes | In Table 10 we report hyper-parameters used to produce our Zip LM pruned models in Section 4. Table 10 includes 'batch-size', 'max-seq-length', 'finetune before pruning', 'finetune in-between pruning steps', 'LR schedule in-between pruning steps', 'initial LR', '#calibration samples', 'speedup-targets', 'knowledge distillation λ1', 'knowledge distillation λ2', 'knowledge distillation λ3', and 'weight-decay'. |