Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Deep Compression of Pre-trained Transformer Models
Authors: Naigang Wang, Chi-Chun (Charlie) Liu, Swagath Venkataramani, Sanchari Sen, Chia-Yu Chen, Kaoutar El Maghraoui, Vijayalakshmi (Viji) Srinivasan, Leland Chang
NeurIPS 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Specifically, we quantize transformer backbones down to 4-bit and further achieve 50% fine-grained structural sparsity on pre-trained BERT, Wav2vec2.0, and Vision Transformer (Vi T) models to demonstrate 16x compression while maintaining model accuracy. |
| Researcher Affiliation | Industry | Naigang Wang Chi-Chun Liu Swagath Venkataramani Sanchari Sen Chia-Yu Chen Kaoutar El Maghraoui Vijayalakshmi Srinivasan Leland Chang IBM T. J. Watson Research Center Yorktown Heights, NY 10598, USA EMAIL |
| Pseudocode | No | The paper describes methods and techniques but does not include any explicitly labeled "Pseudocode" or "Algorithm" blocks. |
| Open Source Code | Yes | Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] Instructions and code examples are in Appendix |
| Open Datasets | Yes | We evaluate the proposed methods on three representative pre-trained models and corresponding downstream benchmarks in multiple application domains to demonstrate the effectiveness of our methods. Specifically, we investigate the BERT-base model on the SQu AD1.1 benchmark, the Wav2vec2.0 model on the Librispeech dataset and the Vi T-base model on the Image Net1k benchmark. |
| Dataset Splits | No | The paper mentions using standard datasets for fine-tuning (SQuAD1.1, Librispeech, ImageNet1k) but does not explicitly provide the training, validation, or test dataset split percentages or counts. |
| Hardware Specification | Yes | All experiments are performed on NVIDIA V100 GPUs. |
| Software Dependencies | No | The paper mentions using 'Huggingface Transformer [36]' and 'Timm packages [37]' but does not provide specific version numbers for these software dependencies. |
| Experiment Setup | Yes | Pre-trained BERT-base model on the SQu AD1.1: For fine-tuning, we use batch size of 12 and sequence length of 384. We use the Adam W optimizer with a learning rate of 3e-5 with linear decay. The model is fine-tuned for 2-4 epochs with a dropout probability of 0.1-0.2. Wav2vec2.0 large model on the Librispeech: For fine tuning, we use the Adam W optimizer with a learning rate of 3e-4. The learning rate decays linearly after 500 warm-up steps. We use a batch size of 32 and tune the model for 6 epochs. Vi T-base model on Image Net1k: The optimizer is SGD with a learning rate of 0.01. We tune the model for 8 epochs using a cosine learning rate schedule, gradient clipping of 1.0, and batch size of 512. |