A Fast Post-Training Pruning Framework for Transformers

Authors: Woosuk Kwon, Sehoon Kim, Michael W. Mahoney, Joseph Hassoun, Kurt Keutzer, Amir Gholami

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We apply our method to BERTBASE and Distil BERT, and we evaluate its effectiveness on GLUE and SQu AD benchmarks. Our framework achieves up to 2.0 reduction in FLOPs and 1.56 speedup in inference latency, while maintaining < 1% loss in accuracy. We extensively test our framework by applying it to BERTBASE and Distil BERT on GLUE and SQu AD tasks (Section 5.2).
Researcher Affiliation Collaboration Woosuk Kwon UC Berkeley woosuk.kwon@berkeley.edu, Joseph Hassoun Samsung Semiconductor, Inc. j.hassoun@samsung.com
Pseudocode Yes Algorithm 1 Mask Search with a FLOPs Constraint
Open Source Code Yes Our code is publicly available at https://github.com/WoosukKwon/retraining-free-pruning
Open Datasets Yes We evaluate the effectiveness of our approach using BERTBASE [12] and Distil BERT [63] on GLUE [78] and SQu AD [60, 61] benchmarks. We use 2K examples from the training sets for pruning, and we evaluate the resulting models on the development sets.
Dataset Splits Yes We use 2K examples from the training sets for pruning, and we evaluate the resulting models on the development sets.
Hardware Specification Yes With batch size of 256, we achieve speedup of 1.47 on average and up to 1.56 on an NVIDIA V100 GPU. For all experiments, we used an AWS p3.2xlarge instance which has 1 NVIDIA V100 GPU.
Software Dependencies No The paper states that the framework is implemented on 'Py Torch [57] and the Hugging Face Transformers [86] library,' but it does not specify any version numbers for these software dependencies.
Experiment Setup Yes We use 2K examples from the training sets for pruning, and we evaluate the resulting models on the development sets. All of the results are averaged over the runs with 10 different seeds. Our method has only two hyperparameters which were fixed in all of our experiments (See Section 4.3). ... Concretely, we re-parameterize the least squares problem as arg minrl ||Arl + A 1 b||2 2 where ml = 1 + rl, and solve it with the damp value fixed to 1. ... In all of our experiments, we fixed the two hyperparameter values as we mentioned here.