PoWER-BERT: Accelerating BERT Inference via Progressive Word-vector Elimination
Authors: Saurabh Goyal, Anamitra Roy Choudhury, Saurabh Raje, Venkatesan Chakaravarthy, Yogish Sabharwal, Ashish Verma
ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present an experimental evaluation on a wide spectrum of classification/regression tasks from the popular GLUE benchmark. The results show that Po WER-BERT achieves up to 4.5x reduction in inference time over BERTBASE with < 1% loss in accuracy. |
| Researcher Affiliation | Industry | 1IBM Research, New Delhi, India 2IBM Research, Yorktown, New York, USA. |
| Pseudocode | No | The paper describes the Po WER-BERT scheme and its components textually and with figures, but does not include any explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code for Po WER-BERT is publicly available at https: //github.com/IBM/Po WER-BERT. |
| Open Datasets | Yes | We evaluate our approach on a wide spectrum of classification/regression tasks pertaining to 9 datasets from the GLUE benchmark (Wang et al., 2019a), and the IMDB (Maas et al., 2011) and the RACE (Lai et al., 2017)) datasets. |
| Dataset Splits | Yes | The hyper-parameters for both Po WER-BERT and the baseline methods were tuned on the Dev dataset for GLUE and RACE tasks. For IMDB, we subdivided the training data into 80% for training and 20% for tuning. |
| Hardware Specification | Yes | The inference time experiments for Po WER-BERT and the baselines were conducted using Keras framework on a K80 GPU machine. |
| Software Dependencies | No | The paper mentions that the code was 'implemented in Keras' but does not specify version numbers for Keras or any other software dependencies. |
| Experiment Setup | Yes | Training Po WER-BERT primarily involves four hyper-parameters, which we select from the ranges listed below: a) learning rate for the newly introduced soft-extract layers [10 4, 10 2]; b) learning rate for the parameters from the original BERT model [2 10 5, 6 10 5]; c) regularization parameter λ that controls the trade-off between accuracy and inference time [10 4, 10 3]; d) batch size {4, 8, 16, 32, 64}. |