MA-BERT: Towards Matrix Arithmetic-only BERT Inference by Eliminating Complex Non-Linear Functions
Authors: Neo Wei Ming, Zhehui Wang, Cheng Liu, Rick Siow Mong Goh, Tao Luo
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experimental results show that MA-BERT achieves up to 27% and 41% reduction in inference time on CPU and GPU, respectively, with comparable accuracy on many downstream tasks compared to the baseline BERT models. |
| Researcher Affiliation | Academia | Neo Wei Ming1,2, Zhehui Wang1, Cheng Liu3, Rick Siow Mong Goh1, Tao Luo1 Institute of High Performance Computing (IHPC), Agency for Science, Technology and Research (A*STAR), 1 Fusionopolis Way, 16-16 Connexis, Singapore 138632, Republic of Singapore1 School of Computer Science and Engineering, Nanyang Technological University, Singapore2 Institute of Computing Technology, Chinese Academy of Sciences3 |
| Pseudocode | No | The paper describes methods and processes in text and mathematical formulas but does not include any explicit pseudocode blocks or algorithm listings. |
| Open Source Code | Yes | https://github.com/W6WM9M/MA-BERT |
| Open Datasets | Yes | We pretrained MA-BERTbase via knowledge transfer on a concatenation of Book Corpus (Zhu et al., 2015) and English Wikipedia using only the Masked Language Modelling objective (Devlin et al., 2018) with a masking probability of 15%. |
| Dataset Splits | Yes | For smaller datasets (Co LA, MRPC, RTE, and STS-B), we used a batch size of 16 and fine-tuned for 10 epochs, and for the remaining datasets, we used a batch size of 32 and fine-tuned for 5 epochs...Table 2: Median performance on the development set of GLUE benchmark. |
| Hardware Specification | Yes | on a single NVIDIA Ge Force RTX 3090 for a total of 3.5 days. |
| Software Dependencies | No | We used C++ with Eigen3 library for CPU implementation and Cuda C++ with Thrust and cu BLAS libraries for GPU implementation. Specific version numbers for these libraries or frameworks are not provided. |
| Experiment Setup | Yes | We used Adam W optimizer with a learning rate of 2e 5, 10,000 warm-up steps from a base learning rate of 5e 7, and a linear decay of learning rate. The training was done with a batch size of 256 over 3 epochs...Throughout our study, we kept t = 15, α = 0.9, β = 1, and γ = 100. |