reproducibilityindex.ai

Deep Compression of Pre-trained Transformer Models

Authors: Naigang Wang, Chi-Chun (Charlie) Liu, Swagath Venkataramani, Sanchari Sen, Chia-Yu Chen, Kaoutar El Maghraoui, Vijayalakshmi (Viji) Srinivasan, Leland Chang

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Speciﬁcally, we quantize transformer backbones down to 4-bit and further achieve 50% ﬁne-grained structural sparsity on pre-trained BERT, Wav2vec2.0, and Vision Transformer (Vi T) models to demonstrate 16x compression while maintaining model accuracy.
Researcher Affiliation	Industry	Naigang Wang Chi-Chun Liu Swagath Venkataramani Sanchari Sen Chia-Yu Chen Kaoutar El Maghraoui Vijayalakshmi Srinivasan Leland Chang IBM T. J. Watson Research Center Yorktown Heights, NY 10598, USA {nwang,cliu,swagath.venkataramani,sanchari.sen, cchen,kelmaghr,viji,lelandc}@us.ibm.com
Pseudocode	No	The paper describes methods and techniques but does not include any explicitly labeled "Pseudocode" or "Algorithm" blocks.
Open Source Code	Yes	Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] Instructions and code examples are in Appendix
Open Datasets	Yes	We evaluate the proposed methods on three representative pre-trained models and corresponding downstream benchmarks in multiple application domains to demonstrate the effectiveness of our methods. Speciﬁcally, we investigate the BERT-base model on the SQu AD1.1 benchmark, the Wav2vec2.0 model on the Librispeech dataset and the Vi T-base model on the Image Net1k benchmark.
Dataset Splits	No	The paper mentions using standard datasets for fine-tuning (SQuAD1.1, Librispeech, ImageNet1k) but does not explicitly provide the training, validation, or test dataset split percentages or counts.
Hardware Specification	Yes	All experiments are performed on NVIDIA V100 GPUs.
Software Dependencies	No	The paper mentions using 'Huggingface Transformer [36]' and 'Timm packages [37]' but does not provide specific version numbers for these software dependencies.
Experiment Setup	Yes	Pre-trained BERT-base model on the SQu AD1.1: For ﬁne-tuning, we use batch size of 12 and sequence length of 384. We use the Adam W optimizer with a learning rate of 3e-5 with linear decay. The model is ﬁne-tuned for 2-4 epochs with a dropout probability of 0.1-0.2. Wav2vec2.0 large model on the Librispeech: For ﬁne tuning, we use the Adam W optimizer with a learning rate of 3e-4. The learning rate decays linearly after 500 warm-up steps. We use a batch size of 32 and tune the model for 6 epochs. Vi T-base model on Image Net1k: The optimizer is SGD with a learning rate of 0.01. We tune the model for 8 epochs using a cosine learning rate schedule, gradient clipping of 1.0, and batch size of 512.