QBB: Quantization with Binary Bases for LLMs
Authors: Adrian Bulat, Yassine Ouali, Georgios Tzimiropoulos
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | When evaluated across multiple LLM families, our approach matches and outperforms all prior works, setting a new state-of-the-art result using a summation-only based approach. |
| Researcher Affiliation | Collaboration | Adrian Bulat1,2 Yassine Ouali1 Georgios Tzimiropoulos1,3 1Samsung AI Cambridge 2Technical University of Iasi 3Queen Mary University of London |
| Pseudocode | No | The paper illustrates processes with figures (Fig. 1, Fig. 2) but does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | No code was included with the paper at submission time. |
| Open Datasets | Yes | We compare our approach with the current state-of-the-art for low-bit quantization in terms of perplexity score on the main benchmark for quantization Wiki Text2 [41], focusing mainly on the LLa MA-2 [53] {7, 13, 70}B family of models. However, we also include results for LLa MA [52] {7, 13, 30, 65}B and Phi-2 [23] 2.7B models. |
| Dataset Splits | No | The paper mentions using Wiki Text2 for evaluation but does not explicitly provide the specific training, validation, and test dataset splits used for its experiments, nor does it cite a source that defines the exact splits within the paper itself. |
| Hardware Specification | Yes | During the input-agnostic quantization part, presented in Sec. 3.1 and using a single A100 GPU, we optimize each set of binary matrices and scaling vectors, layer by layer... |
| Software Dependencies | No | The paper states 'We implement our method using Py Torch [43]' but does not provide a specific version number for PyTorch or other software dependencies. |
| Experiment Setup | Yes | During the input-agnostic quantization part... using the following hyperparameters: Adam optimizer [28], 15000 iterations, no weight decay, an initial learning rate of 1e 4 decayed to 0 using a cosine scheduler. For the data-free distillation step... we fine-tune the scaling vectors only for 2 epochs using an Adam optimizer, a cosine learning rate scheduler, no weights decay, and an initial learning rate set to 2.5e 4. For added stability, we clip the gradients with a norm higher than 1. |