BiBERT: Accurate Fully Binarized BERT

Authors: Haotong Qin, Yifu Ding, Mingyuan Zhang, Qinghua YAN, Aishan Liu, Qingqing Dang, Ziwei Liu, Xianglong Liu

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments show that Bi BERT outperforms both the straightforward baseline and existing state-of-the-art quantized BERTs with ultra-low bit activations by convincing margins on the NLP benchmark. As the first fully binarized BERT, our method yields impressive 56.3 and 31.2 saving on FLOPs and model size, demonstrating the vast advantages and potential of the fully binarized BERT model in real-world resource-constrained scenarios. In this section, we conduct extensive experiments to validate the effectiveness of our proposed Bi BERT for efficient learning on the multiple architectures and the GLUE (Wang et al., 2018a) benchmark with diverse NLP tasks.
Researcher Affiliation Collaboration Haotong Qin 1,4, Yifu Ding 1,4, Mingyuan Zhang 2, Qinghua Yan1, Aishan Liu1, Qingqing Dang3, Ziwei Liu2, Xianglong Liu 1 1State Key Lab of Software Development Environment, Beihang University 3Baidu Inc. 2S-Lab, Nanyang Technological University 4Shen Yuan Honors College, Beihang University
Pseudocode No The paper describes methods textually and visually in Figure 7, but does not include structured pseudocode or algorithm blocks.
Open Source Code Yes Our code is released at https://github.com/htqin/Bi BERT.
Open Datasets Yes GLUE (Wang et al., 2018a) benchmark
Dataset Splits Yes Results show that Bi BERT outperforms other methods on the development set of GLUE benchmark, including Ternary BERT, Binary BERT, Q-BERT, and Q2BERT.
Hardware Specification No The paper discusses computation (FLOPs) and model size savings but does not specify any particular hardware (GPU/CPU models, memory, etc.) used for the experiments.
Software Dependencies No The paper mentions using
Experiment Setup Yes We use the Adam as our optimizer, and adopt data augmentation on GLUE tasks except MNLI and QQP for the little benefit but it is time-consuming. It is noteworthy that we take more training epochs for every quantization method on each tasks to have a sufficient training, which is 50 for Co LA, 20 for MRPC, STS-B and RTE, 10 for SST-2 and QNLI, 5 for MNLI and QQP.