BiBERT: Accurate Fully Binarized BERT
Authors: Haotong Qin, Yifu Ding, Mingyuan Zhang, Qinghua YAN, Aishan Liu, Qingqing Dang, Ziwei Liu, Xianglong Liu
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments show that Bi BERT outperforms both the straightforward baseline and existing state-of-the-art quantized BERTs with ultra-low bit activations by convincing margins on the NLP benchmark. As the first fully binarized BERT, our method yields impressive 56.3 and 31.2 saving on FLOPs and model size, demonstrating the vast advantages and potential of the fully binarized BERT model in real-world resource-constrained scenarios. In this section, we conduct extensive experiments to validate the effectiveness of our proposed Bi BERT for efficient learning on the multiple architectures and the GLUE (Wang et al., 2018a) benchmark with diverse NLP tasks. |
| Researcher Affiliation | Collaboration | Haotong Qin 1,4, Yifu Ding 1,4, Mingyuan Zhang 2, Qinghua Yan1, Aishan Liu1, Qingqing Dang3, Ziwei Liu2, Xianglong Liu 1 1State Key Lab of Software Development Environment, Beihang University 3Baidu Inc. 2S-Lab, Nanyang Technological University 4Shen Yuan Honors College, Beihang University |
| Pseudocode | No | The paper describes methods textually and visually in Figure 7, but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is released at https://github.com/htqin/Bi BERT. |
| Open Datasets | Yes | GLUE (Wang et al., 2018a) benchmark |
| Dataset Splits | Yes | Results show that Bi BERT outperforms other methods on the development set of GLUE benchmark, including Ternary BERT, Binary BERT, Q-BERT, and Q2BERT. |
| Hardware Specification | No | The paper discusses computation (FLOPs) and model size savings but does not specify any particular hardware (GPU/CPU models, memory, etc.) used for the experiments. |
| Software Dependencies | No | The paper mentions using |
| Experiment Setup | Yes | We use the Adam as our optimizer, and adopt data augmentation on GLUE tasks except MNLI and QQP for the little benefit but it is time-consuming. It is noteworthy that we take more training epochs for every quantization method on each tasks to have a sufficient training, which is 50 for Co LA, 20 for MRPC, STS-B and RTE, 10 for SST-2 and QNLI, 5 for MNLI and QQP. |