PB-LLM: Partially Binarized Large Language Models

Authors: Zhihang Yuan, Yuzhang Shang, Zhen Dong

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we explore network binarization specifically for LLM quantization and propose Partiallybinarized LLMs (abbreviated as PB-LLM). Specifically, following the binarization benchmark in Bi Bench [Qin et al., 2023], we generalize some representative binarization methods into LLM quantization scenarios. The results evaluated on seven zero-shot common sense reasoning tasks are shown in Fig. 2.
Researcher Affiliation Collaboration Zhihang Yuan Houmo AI Zhen Dong UC Berkeley
Pseudocode No The paper describes mathematical formulations and processes but does not include any clearly labeled "Pseudocode" or "Algorithm" blocks.
Open Source Code Yes The code is available at PB-LLM. Codes can be found anomalously in PB-LLM.
Open Datasets Yes In this study, the PB-LLM is trained using the Red Pajama-simple-1B dataset, as the dataset for LLa MA training is not openly accessible. This dataset, Red Pajama-1T, is structured to closely resemble the LLa Ma paper and serves as a transparent, open-source alternative to LLM training dataset. It amalgamates data from diverse sources including Commoncrawl, C4, Git Hub, Wikipedia, Gutenberg Books3, Ar Xiv, and Stackexchange.
Dataset Splits No The paper evaluates on zero-shot common sense reasoning tasks and perplexity on C4 and WikiText2, but it does not specify the train/validation/test splits for the Red Pajama-simple-1B dataset used for training, nor does it provide a splitting methodology for the datasets used in evaluation.
Hardware Specification No The paper mentions memory constraints on devices or single GPUs/servers as a motivation for their work but does not specify the hardware (e.g., GPU models, CPU types) used for running their experiments.
Software Dependencies No The paper mentions the Adam W optimizer and uses standard libraries implicitly, but it does not specify any software dependencies with version numbers (e.g., PyTorch version, Python version).
Experiment Setup Yes The optimization of the model is facilitated through the Adam W optimizer [Loshchilov and Hutter, 2017], applied with zero weight decay. We assign a batch size of 1 to each GPU and implement a learning rate of 2e-5, adhering to a cosine learning rate decay strategy. We only fine-tune our PB-LLM for 10K iterations.