PB-LLM: Partially Binarized Large Language Models
Authors: Zhihang Yuan, Yuzhang Shang, Zhen Dong
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we explore network binarization specifically for LLM quantization and propose Partiallybinarized LLMs (abbreviated as PB-LLM). Specifically, following the binarization benchmark in Bi Bench [Qin et al., 2023], we generalize some representative binarization methods into LLM quantization scenarios. The results evaluated on seven zero-shot common sense reasoning tasks are shown in Fig. 2. |
| Researcher Affiliation | Collaboration | Zhihang Yuan Houmo AI Zhen Dong UC Berkeley |
| Pseudocode | No | The paper describes mathematical formulations and processes but does not include any clearly labeled "Pseudocode" or "Algorithm" blocks. |
| Open Source Code | Yes | The code is available at PB-LLM. Codes can be found anomalously in PB-LLM. |
| Open Datasets | Yes | In this study, the PB-LLM is trained using the Red Pajama-simple-1B dataset, as the dataset for LLa MA training is not openly accessible. This dataset, Red Pajama-1T, is structured to closely resemble the LLa Ma paper and serves as a transparent, open-source alternative to LLM training dataset. It amalgamates data from diverse sources including Commoncrawl, C4, Git Hub, Wikipedia, Gutenberg Books3, Ar Xiv, and Stackexchange. |
| Dataset Splits | No | The paper evaluates on zero-shot common sense reasoning tasks and perplexity on C4 and WikiText2, but it does not specify the train/validation/test splits for the Red Pajama-simple-1B dataset used for training, nor does it provide a splitting methodology for the datasets used in evaluation. |
| Hardware Specification | No | The paper mentions memory constraints on devices or single GPUs/servers as a motivation for their work but does not specify the hardware (e.g., GPU models, CPU types) used for running their experiments. |
| Software Dependencies | No | The paper mentions the Adam W optimizer and uses standard libraries implicitly, but it does not specify any software dependencies with version numbers (e.g., PyTorch version, Python version). |
| Experiment Setup | Yes | The optimization of the model is facilitated through the Adam W optimizer [Loshchilov and Hutter, 2017], applied with zero weight decay. We assign a batch size of 1 to each GPU and implement a learning rate of 2e-5, adhering to a cosine learning rate decay strategy. We only fine-tune our PB-LLM for 10K iterations. |