MA-BERT: Towards Matrix Arithmetic-only BERT Inference by Eliminating Complex Non-Linear Functions

Authors: Neo Wei Ming, Zhehui Wang, Cheng Liu, Rick Siow Mong Goh, Tao Luo

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experimental results show that MA-BERT achieves up to 27% and 41% reduction in inference time on CPU and GPU, respectively, with comparable accuracy on many downstream tasks compared to the baseline BERT models.
Researcher Affiliation Academia Neo Wei Ming1,2, Zhehui Wang1, Cheng Liu3, Rick Siow Mong Goh1, Tao Luo1 Institute of High Performance Computing (IHPC), Agency for Science, Technology and Research (A*STAR), 1 Fusionopolis Way, 16-16 Connexis, Singapore 138632, Republic of Singapore1 School of Computer Science and Engineering, Nanyang Technological University, Singapore2 Institute of Computing Technology, Chinese Academy of Sciences3
Pseudocode No The paper describes methods and processes in text and mathematical formulas but does not include any explicit pseudocode blocks or algorithm listings.
Open Source Code Yes https://github.com/W6WM9M/MA-BERT
Open Datasets Yes We pretrained MA-BERTbase via knowledge transfer on a concatenation of Book Corpus (Zhu et al., 2015) and English Wikipedia using only the Masked Language Modelling objective (Devlin et al., 2018) with a masking probability of 15%.
Dataset Splits Yes For smaller datasets (Co LA, MRPC, RTE, and STS-B), we used a batch size of 16 and fine-tuned for 10 epochs, and for the remaining datasets, we used a batch size of 32 and fine-tuned for 5 epochs...Table 2: Median performance on the development set of GLUE benchmark.
Hardware Specification Yes on a single NVIDIA Ge Force RTX 3090 for a total of 3.5 days.
Software Dependencies No We used C++ with Eigen3 library for CPU implementation and Cuda C++ with Thrust and cu BLAS libraries for GPU implementation. Specific version numbers for these libraries or frameworks are not provided.
Experiment Setup Yes We used Adam W optimizer with a learning rate of 2e 5, 10,000 warm-up steps from a base learning rate of 5e 7, and a linear decay of learning rate. The training was done with a batch size of 256 over 3 epochs...Throughout our study, we kept t = 15, α = 0.9, β = 1, and γ = 100.