I-BERT: Integer-only BERT Quantization
Authors: Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our approach on GLUE downstream tasks using RoBERTa Base/Large. We show that for both cases, I-BERT achieves similar (and slightly higher) accuracy as compared to the full-precision baseline. Furthermore, our preliminary implementation of I-BERT shows a speedup of 2.4-4.0 for INT8 inference on a T4 GPU system as compared to FP32 inference. |
| Researcher Affiliation | Academia | University of California, Berkeley. |
| Pseudocode | Yes | Algorithm 1 Integer-only Computation of Second-order Polynomial a(x + b)2 + c, Algorithm 2 Integer-only GELU, Algorithm 3 Integer-only Exponential and Softmax, Algorithm 4 Integer-only Square Root |
| Open Source Code | Yes | The framework has been developed in Py Torch and has been open-sourced (Kim, 2021). |
| Open Datasets | Yes | We evaluate our approach on GLUE downstream tasks using RoBERTa Base/Large. |
| Dataset Splits | Yes | For each of the GLUE downstream tasks, we train both FP32 baseline and integer-only I-BERT models, and evaluate the accuracy on the development set. |
| Hardware Specification | Yes | Furthermore, our preliminary implementation of I-BERT shows a speedup of 2.4-4.0 for INT8 inference on a T4 GPU system as compared to FP32 inference. |
| Software Dependencies | No | The framework has been developed in Py Torch and has been open-sourced (Kim, 2021). Specific version numbers for PyTorch and TensorRT are not provided. |
| Experiment Setup | Yes | See Appendix C.2 and C.3 for training and evaluation details. |