PoWER-BERT: Accelerating BERT Inference via Progressive Word-vector Elimination

Authors: Saurabh Goyal, Anamitra Roy Choudhury, Saurabh Raje, Venkatesan Chakaravarthy, Yogish Sabharwal, Ashish Verma

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present an experimental evaluation on a wide spectrum of classification/regression tasks from the popular GLUE benchmark. The results show that Po WER-BERT achieves up to 4.5x reduction in inference time over BERTBASE with < 1% loss in accuracy.
Researcher Affiliation Industry 1IBM Research, New Delhi, India 2IBM Research, Yorktown, New York, USA.
Pseudocode No The paper describes the Po WER-BERT scheme and its components textually and with figures, but does not include any explicit pseudocode or algorithm blocks.
Open Source Code Yes The code for Po WER-BERT is publicly available at https: //github.com/IBM/Po WER-BERT.
Open Datasets Yes We evaluate our approach on a wide spectrum of classification/regression tasks pertaining to 9 datasets from the GLUE benchmark (Wang et al., 2019a), and the IMDB (Maas et al., 2011) and the RACE (Lai et al., 2017)) datasets.
Dataset Splits Yes The hyper-parameters for both Po WER-BERT and the baseline methods were tuned on the Dev dataset for GLUE and RACE tasks. For IMDB, we subdivided the training data into 80% for training and 20% for tuning.
Hardware Specification Yes The inference time experiments for Po WER-BERT and the baselines were conducted using Keras framework on a K80 GPU machine.
Software Dependencies No The paper mentions that the code was 'implemented in Keras' but does not specify version numbers for Keras or any other software dependencies.
Experiment Setup Yes Training Po WER-BERT primarily involves four hyper-parameters, which we select from the ranges listed below: a) learning rate for the newly introduced soft-extract layers [10 4, 10 2]; b) learning rate for the parameters from the original BERT model [2 10 5, 6 10 5]; c) regularization parameter λ that controls the trade-off between accuracy and inference time [10 4, 10 3]; d) batch size {4, 8, 16, 32, 64}.