The Need for Speed: Pruning Transformers with One Recipe

Authors: Samir Khaki, Konstantinos N Plataniotis

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental produce state-of-the-art results on natural language, image classification, transfer learning, and semantic segmentation tasks. Our motivation stems from the need for a generalizable model compression framework that scales well across different transformer architectures and applications. Given a FLOP constraint, the OPTIN framework will compress the network while maintaining competitive accuracy performance and improved throughput. Particularly, we show a 2% accuracy degradation from NLP baselines and a 0.5% improvement from stateof-the-art methods on image classification at competitive FLOPs reductions. We further demonstrate the generalization of tasks and architecture with comparative performance on Mask2Former for semantic segmentation and cnn-style networks.
Researcher Affiliation Academia Samir Khaki , Konstantinos N. Plataniotis Department of Electrical and Computer Engineering University of Toronto Toronto, Canada samir.khaki@mail.utoronto.ca
Pseudocode Yes Algorithm 1 OPTIN Framework for Model Compression
Open Source Code Yes Code is available at: https://github.com/Skhaki18/optin-transformer-pruning.
Open Datasets Yes For Natural Language Processing, OPTIN is evaluated on the GLUE Benchmark (Wang et al., 2019)... For Image Classification, both Image Net1-K (Deng et al., 2009) and CIFAR10 (Krizhevsky et al., 2009)... For Semantic Segmentation, the Cityscapes Dataset (Cordts et al., 2016)...
Dataset Splits No The paper mentions training and validation images/data for some datasets (e.g., ImageNet-1K, CIFAR-10) and refers to validation error, but does not explicitly state the specific train/validation/test splits (e.g., percentages or exact counts for all splits) needed to reproduce the experiment's data partitioning.
Hardware Specification Yes All time measurements are captured over 300 iterations on an Nvidia RTX 2080 using a 100-iteration warmup.
Software Dependencies No We implement our method using transformers from the Hugging Face Library (Wolf et al., 2020) and infrastructure from Py Torch (Paszke et al., 2019). The paper names the software used but does not provide specific version numbers for reproducibility.
Experiment Setup Yes All time measurements are captured over 300 iterations on an Nvidia RTX 2080 using a 100-iteration warmup. The amount (batch) of data used to compute the scores is ablated in Appendix A.6. In Tab 1d, we use the λ sweep to express relative magnitude differences between LMD and LKD.