Accurate LoRA-Finetuning Quantization of LLMs via Information Retention
Authors: Haotong Qin, Xudong Ma, Xingyu Zheng, Xiaoyang Li, Yang Zhang, Shouda Liu, Jie Luo, Xianglong Liu, Michele Magno
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Comprehensive experiments show that IR-QLo RA can significantly improve accuracy across LLa MA and LLa MA2 families under 2-4 bit-widths, e.g., 4bit LLa MA-7B achieves 1.4% improvement on MMLU compared with the state-of-the-art methods. |
| Researcher Affiliation | Collaboration | 1ETH Z urich 2Beihang University 3Bytedance AI Lab. |
| Pseudocode | Yes | Algorithm 1 The weight search process within each block in IR-QLo RA |
| Open Source Code | Yes | The code is available at https://github.com/htqin/ir-qlora. |
| Open Datasets | Yes | Our IR-QLo RA is established upon the LLa MA (Touvron et al., 2023a) and LLa MA2 (Touvron et al., 2023b) families... and constructs parameter-efficient finetuning on Alpaca (Taori et al., 2023) and Flan v2 (Longpre et al., 2023) datasets. |
| Dataset Splits | No | The paper states that Alpaca and Flan v2 datasets were used for finetuning, and MMLU and Commonsense QA benchmarks for evaluation, but does not explicitly provide training/validation/test splits for the finetuning datasets themselves. |
| Hardware Specification | Yes | All our experiments are conducted on Nvidia Tesla A100 GPUs. |
| Software Dependencies | No | The paper mentions using an optimizer (paged AdamW) and specifies hyperparameters, but does not provide specific version numbers for software dependencies such as deep learning frameworks or libraries. |
| Experiment Setup | Yes | Following (Dettmers et al., 2023), we apply the double quantization mechanism, and set the block size is 64 for quantization and 256 for double quantization. Regarding Lo RA parameters, we set r = 64, α = 16, and Lo RA dropout of 0.1 for models up to 13B and 0.05 for 33B and 65B models. We employ the paged Adam W optimizer with a beta2 value of 0.999, and a learning rate of 2e-4 for models up to 13B and 1e-4 for 33B and 65B models., limiting the maximum gradient norm to 0.3 and adopting a constant learning rate strategy. Fine-tuning was executed for 10,000 and 20,000 steps on the Alpaca and FLAN v2 datasets, respectively, utilizing a batch size 16. |