reproducibilityindex.ai

BiBERT: Accurate Fully Binarized BERT

Authors: Haotong Qin, Yifu Ding, Mingyuan Zhang, Qinghua YAN, Aishan Liu, Qingqing Dang, Ziwei Liu, Xianglong Liu

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments show that Bi BERT outperforms both the straightforward baseline and existing state-of-the-art quantized BERTs with ultra-low bit activations by convincing margins on the NLP benchmark. As the ﬁrst fully binarized BERT, our method yields impressive 56.3 and 31.2 saving on FLOPs and model size, demonstrating the vast advantages and potential of the fully binarized BERT model in real-world resource-constrained scenarios. In this section, we conduct extensive experiments to validate the effectiveness of our proposed Bi BERT for efﬁcient learning on the multiple architectures and the GLUE (Wang et al., 2018a) benchmark with diverse NLP tasks.
Researcher Affiliation	Collaboration	Haotong Qin 1,4, Yifu Ding 1,4, Mingyuan Zhang 2, Qinghua Yan1, Aishan Liu1, Qingqing Dang3, Ziwei Liu2, Xianglong Liu 1 1State Key Lab of Software Development Environment, Beihang University 3Baidu Inc. 2S-Lab, Nanyang Technological University 4Shen Yuan Honors College, Beihang University
Pseudocode	No	The paper describes methods textually and visually in Figure 7, but does not include structured pseudocode or algorithm blocks.
Open Source Code	Yes	Our code is released at https://github.com/htqin/Bi BERT.
Open Datasets	Yes	GLUE (Wang et al., 2018a) benchmark
Dataset Splits	Yes	Results show that Bi BERT outperforms other methods on the development set of GLUE benchmark, including Ternary BERT, Binary BERT, Q-BERT, and Q2BERT.
Hardware Specification	No	The paper discusses computation (FLOPs) and model size savings but does not specify any particular hardware (GPU/CPU models, memory, etc.) used for the experiments.
Software Dependencies	No	The paper mentions using
Experiment Setup	Yes	We use the Adam as our optimizer, and adopt data augmentation on GLUE tasks except MNLI and QQP for the little beneﬁt but it is time-consuming. It is noteworthy that we take more training epochs for every quantization method on each tasks to have a sufﬁcient training, which is 50 for Co LA, 20 for MRPC, STS-B and RTE, 10 for SST-2 and QNLI, 5 for MNLI and QQP.