reproducibilityindex.ai

PoWER-BERT: Accelerating BERT Inference via Progressive Word-vector Elimination

Authors: Saurabh Goyal, Anamitra Roy Choudhury, Saurabh Raje, Venkatesan Chakaravarthy, Yogish Sabharwal, Ashish Verma

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We present an experimental evaluation on a wide spectrum of classiﬁcation/regression tasks from the popular GLUE benchmark. The results show that Po WER-BERT achieves up to 4.5x reduction in inference time over BERTBASE with < 1% loss in accuracy.
Researcher Affiliation	Industry	1IBM Research, New Delhi, India 2IBM Research, Yorktown, New York, USA.
Pseudocode	No	The paper describes the Po WER-BERT scheme and its components textually and with figures, but does not include any explicit pseudocode or algorithm blocks.
Open Source Code	Yes	The code for Po WER-BERT is publicly available at https: //github.com/IBM/Po WER-BERT.
Open Datasets	Yes	We evaluate our approach on a wide spectrum of classiﬁcation/regression tasks pertaining to 9 datasets from the GLUE benchmark (Wang et al., 2019a), and the IMDB (Maas et al., 2011) and the RACE (Lai et al., 2017)) datasets.
Dataset Splits	Yes	The hyper-parameters for both Po WER-BERT and the baseline methods were tuned on the Dev dataset for GLUE and RACE tasks. For IMDB, we subdivided the training data into 80% for training and 20% for tuning.
Hardware Specification	Yes	The inference time experiments for Po WER-BERT and the baselines were conducted using Keras framework on a K80 GPU machine.
Software Dependencies	No	The paper mentions that the code was 'implemented in Keras' but does not specify version numbers for Keras or any other software dependencies.
Experiment Setup	Yes	Training Po WER-BERT primarily involves four hyper-parameters, which we select from the ranges listed below: a) learning rate for the newly introduced soft-extract layers [10 4, 10 2]; b) learning rate for the parameters from the original BERT model [2 10 5, 6 10 5]; c) regularization parameter λ that controls the trade-off between accuracy and inference time [10 4, 10 3]; d) batch size {4, 8, 16, 32, 64}.