Deep Compression of Pre-trained Transformer Models

Authors: Naigang Wang, Chi-Chun (Charlie) Liu, Swagath Venkataramani, Sanchari Sen, Chia-Yu Chen, Kaoutar El Maghraoui, Vijayalakshmi (Viji) Srinivasan, Leland Chang

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Specifically, we quantize transformer backbones down to 4-bit and further achieve 50% fine-grained structural sparsity on pre-trained BERT, Wav2vec2.0, and Vision Transformer (Vi T) models to demonstrate 16x compression while maintaining model accuracy.
Researcher Affiliation Industry Naigang Wang Chi-Chun Liu Swagath Venkataramani Sanchari Sen Chia-Yu Chen Kaoutar El Maghraoui Vijayalakshmi Srinivasan Leland Chang IBM T. J. Watson Research Center Yorktown Heights, NY 10598, USA {nwang,cliu,swagath.venkataramani,sanchari.sen, cchen,kelmaghr,viji,lelandc}@us.ibm.com
Pseudocode No The paper describes methods and techniques but does not include any explicitly labeled "Pseudocode" or "Algorithm" blocks.
Open Source Code Yes Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] Instructions and code examples are in Appendix
Open Datasets Yes We evaluate the proposed methods on three representative pre-trained models and corresponding downstream benchmarks in multiple application domains to demonstrate the effectiveness of our methods. Specifically, we investigate the BERT-base model on the SQu AD1.1 benchmark, the Wav2vec2.0 model on the Librispeech dataset and the Vi T-base model on the Image Net1k benchmark.
Dataset Splits No The paper mentions using standard datasets for fine-tuning (SQuAD1.1, Librispeech, ImageNet1k) but does not explicitly provide the training, validation, or test dataset split percentages or counts.
Hardware Specification Yes All experiments are performed on NVIDIA V100 GPUs.
Software Dependencies No The paper mentions using 'Huggingface Transformer [36]' and 'Timm packages [37]' but does not provide specific version numbers for these software dependencies.
Experiment Setup Yes Pre-trained BERT-base model on the SQu AD1.1: For fine-tuning, we use batch size of 12 and sequence length of 384. We use the Adam W optimizer with a learning rate of 3e-5 with linear decay. The model is fine-tuned for 2-4 epochs with a dropout probability of 0.1-0.2. Wav2vec2.0 large model on the Librispeech: For fine tuning, we use the Adam W optimizer with a learning rate of 3e-4. The learning rate decays linearly after 500 warm-up steps. We use a batch size of 32 and tune the model for 6 epochs. Vi T-base model on Image Net1k: The optimizer is SGD with a learning rate of 0.01. We tune the model for 8 epochs using a cosine learning rate schedule, gradient clipping of 1.0, and batch size of 512.