LoftQ: LoRA-Fine-Tuning-aware Quantization for Large Language Models

Authors: Yixiao Li, Yifan Yu, Chen Liang, Nikos Karampatziakis, Pengcheng He, Weizhu Chen, Tuo Zhao

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our method on natural language understanding, question answering, summarization, and natural language generation tasks. Experiments show that our method is highly effective and outperforms existing quantization methods, especially in the challenging 2-bit and 2/4-bit mixed precision regimes.
Researcher Affiliation Collaboration 1Li, Yu, Liang and Zhao are affiliated with Georgia Institute of Technology. Correspondence to yixiaoli@gatech.edu, yyu429@gatech.edu and tourzhao@gatech.edu. 2He, Karampatziakisand and Chen are affiliated with Microsoft Azure.
Pseudocode Yes Algorithm 1 Loft Q
Open Source Code Yes The code is available on https://github.com/yxli2123/Loft Q.
Open Datasets Yes We evaluate our method on NLU and NLG tasks. We apply Loft Q for quantizing DeBERTa V3-base (He et al., 2021b), BART-large (Lewis et al., 2019), and LLAMA-2 series (Touvron et al., 2023). Models and Datasets. We quantize the DeBERTa V3-base (He et al., 2021b) with Loft Q, then finetune and evaluate the model on the General Language Understanding Evaluation (GLUE) benchmark (Wang et al., 2019), SQuADv1.1 (Rajpurkar et al., 2016), and ANLI (Nie et al., 2019)... We quantize BART-large model (Lewis et al., 2020) with Loft Q, then finetune and evaluate the model on two commonly used summarization datasets: XSum (Narayan et al., 2018) and CNN/Daily Mail (Hermann et al., 2015)... We quantize LLAMA-2-7b and LLAMA-2-13b (Touvron et al., 2023) with Loft Q. We then fine-tune and evaluate the models on two NLG datasets: GSM8K (Cobbe et al., 2021) and WikiText-2 (Merity et al., 2016).
Dataset Splits Yes We evaluate the model on the General Language Understanding Evaluation (GLUE) benchmark (Wang et al., 2019), SQuADv1.1 (Rajpurkar et al., 2016), and ANLI (Nie et al., 2019). The specific tasks of GLUE are given in Appendix C. Following previous works (Zhang et al., 2023), we exclude WNLI in the experiments. ... Table 1 and Table 2 summarize the results for 2-bit quantization on the GLUE, SQuADv1.1, and ANLI datasets... We use batch size of 32 for all GLUE tasks and ANLI. We use batch size of 16 for SQuADv1.1. We use Loft Q of 5 iterations for all GLUE tasks.
Hardware Specification Yes All the experiments are conducted on NVIDIA A100 GPUs.
Software Dependencies No The paper states 'Our implementation is based on publicly available Huggingface Transformers code-base' and cites 'Pytorch', but it does not provide specific version numbers for these software dependencies.
Experiment Setup Yes We select the learning rates from {1e-5, 5e-5, 1e-4, 5e-4}. We use batch size of 32 for all GLUE tasks and ANLI. We use batch size of 16 for SQuADv1.1. We use Loft Q of 5 iterations for all GLUE tasks. We train 2 epochs on WikiText-2 and 6 epochs on GSM8K. We select learning rate from {1e-5, 5e-5, 7e-5, 1e-4, , 3e-4, 4e-4}. Specific settings are summarized in Table 16 and Table 17.