Deep Compression of Pre-trained Transformer Models
Authors: Naigang Wang, Chi-Chun (Charlie) Liu, Swagath Venkataramani, Sanchari Sen, Chia-Yu Chen, Kaoutar El Maghraoui, Vijayalakshmi (Viji) Srinivasan, Leland Chang
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Specifically, we quantize transformer backbones down to 4-bit and further achieve 50% fine-grained structural sparsity on pre-trained BERT, Wav2vec2.0, and Vision Transformer (Vi T) models to demonstrate 16x compression while maintaining model accuracy. |
| Researcher Affiliation | Industry | Naigang Wang Chi-Chun Liu Swagath Venkataramani Sanchari Sen Chia-Yu Chen Kaoutar El Maghraoui Vijayalakshmi Srinivasan Leland Chang IBM T. J. Watson Research Center Yorktown Heights, NY 10598, USA {nwang,cliu,swagath.venkataramani,sanchari.sen, cchen,kelmaghr,viji,lelandc}@us.ibm.com |
| Pseudocode | No | The paper describes methods and techniques but does not include any explicitly labeled "Pseudocode" or "Algorithm" blocks. |
| Open Source Code | Yes | Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] Instructions and code examples are in Appendix |
| Open Datasets | Yes | We evaluate the proposed methods on three representative pre-trained models and corresponding downstream benchmarks in multiple application domains to demonstrate the effectiveness of our methods. Specifically, we investigate the BERT-base model on the SQu AD1.1 benchmark, the Wav2vec2.0 model on the Librispeech dataset and the Vi T-base model on the Image Net1k benchmark. |
| Dataset Splits | No | The paper mentions using standard datasets for fine-tuning (SQuAD1.1, Librispeech, ImageNet1k) but does not explicitly provide the training, validation, or test dataset split percentages or counts. |
| Hardware Specification | Yes | All experiments are performed on NVIDIA V100 GPUs. |
| Software Dependencies | No | The paper mentions using 'Huggingface Transformer [36]' and 'Timm packages [37]' but does not provide specific version numbers for these software dependencies. |
| Experiment Setup | Yes | Pre-trained BERT-base model on the SQu AD1.1: For fine-tuning, we use batch size of 12 and sequence length of 384. We use the Adam W optimizer with a learning rate of 3e-5 with linear decay. The model is fine-tuned for 2-4 epochs with a dropout probability of 0.1-0.2. Wav2vec2.0 large model on the Librispeech: For fine tuning, we use the Adam W optimizer with a learning rate of 3e-4. The learning rate decays linearly after 500 warm-up steps. We use a batch size of 32 and tune the model for 6 epochs. Vi T-base model on Image Net1k: The optimizer is SGD with a learning rate of 0.01. We tune the model for 8 epochs using a cosine learning rate schedule, gradient clipping of 1.0, and batch size of 512. |