OneBit: Towards Extremely Low-bit Large Language Models

Authors: Yuzhuang Xu, Xu Han, Zonghan Yang, Shuo Wang, Qingfu Zhu, Zhiyuan Liu, Weidong Liu, Wanxiang Che

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Sufficient experimental results indicate that One Bit achieves good performance (at least 81% of the non-quantized performance on LLa MA models) with robust training processes when only using 1-bit weight matrices.
Researcher Affiliation Academia Yuzhuang Xu1 Xu Han2 Zonghan Yang2 Shuo Wang2 Qingfu Zhu1 Zhiyuan Liu2 Weidong Liu2 Wanxiang Che1,B 1Research Center for Social Computing and Information Retrieval, Harbin Institute of Technology, Harbin, China 2Department of Computer Science & Technology, Tsinghua University, Beijing, China
Pseudocode No No pseudocode or algorithm blocks were found.
Open Source Code Yes Code and checkpoints are available at https://github.com/xuyuzhuang11/One Bit
Open Datasets Yes We evaluate our approach by performing experiments on OPT-1.3B/2.7B models, LLa MA7B/13B models and LLa MA2-7B/13B models, and present results on various tasks. ... Specifically, on Wiki Text2 [23] and C4 [28].
Dataset Splits Yes Basically, we evaluate quantized models by testing the perplexity on the validation set, specifically on Wiki Text2 [23] and C4 [28].
Hardware Specification Yes We use NVIDIA A100 GPUs and maintain FP16 precision while training quantized models.
Software Dependencies No We employ NMF in scikit-learn 1 to decompose the weight matrices in SVID.
Experiment Setup Yes Every KD experiment learns the training data over 50 epochs, from which 2048token segments are selected. We employ NMF in scikit-learn to decompose the weight matrices in SVID. The quantized student models are optimized by Adam [19] with β1 = 0.9, β2 = 0.98. The learning rate for all experiments is scheduled by cosine strategy. We use NVIDIA A100 GPUs and maintain FP16 precision while training quantized models. For additional details such as learning rate, please refer to Table 1.