A Fast Post-Training Pruning Framework for Transformers
Authors: Woosuk Kwon, Sehoon Kim, Michael W. Mahoney, Joseph Hassoun, Kurt Keutzer, Amir Gholami
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We apply our method to BERTBASE and Distil BERT, and we evaluate its effectiveness on GLUE and SQu AD benchmarks. Our framework achieves up to 2.0 reduction in FLOPs and 1.56 speedup in inference latency, while maintaining < 1% loss in accuracy. We extensively test our framework by applying it to BERTBASE and Distil BERT on GLUE and SQu AD tasks (Section 5.2). |
| Researcher Affiliation | Collaboration | Woosuk Kwon UC Berkeley woosuk.kwon@berkeley.edu, Joseph Hassoun Samsung Semiconductor, Inc. j.hassoun@samsung.com |
| Pseudocode | Yes | Algorithm 1 Mask Search with a FLOPs Constraint |
| Open Source Code | Yes | Our code is publicly available at https://github.com/WoosukKwon/retraining-free-pruning |
| Open Datasets | Yes | We evaluate the effectiveness of our approach using BERTBASE [12] and Distil BERT [63] on GLUE [78] and SQu AD [60, 61] benchmarks. We use 2K examples from the training sets for pruning, and we evaluate the resulting models on the development sets. |
| Dataset Splits | Yes | We use 2K examples from the training sets for pruning, and we evaluate the resulting models on the development sets. |
| Hardware Specification | Yes | With batch size of 256, we achieve speedup of 1.47 on average and up to 1.56 on an NVIDIA V100 GPU. For all experiments, we used an AWS p3.2xlarge instance which has 1 NVIDIA V100 GPU. |
| Software Dependencies | No | The paper states that the framework is implemented on 'Py Torch [57] and the Hugging Face Transformers [86] library,' but it does not specify any version numbers for these software dependencies. |
| Experiment Setup | Yes | We use 2K examples from the training sets for pruning, and we evaluate the resulting models on the development sets. All of the results are averaged over the runs with 10 different seeds. Our method has only two hyperparameters which were fixed in all of our experiments (See Section 4.3). ... Concretely, we re-parameterize the least squares problem as arg minrl ||Arl + A 1 b||2 2 where ml = 1 + rl, and solve it with the damp value fixed to 1. ... In all of our experiments, we fixed the two hyperparameter values as we mentioned here. |