QLoRA: Efficient Finetuning of Quantized LLMs
Authors: Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present QLORA, an efficient finetuning approach that reduces memory usage enough to finetune a 65B parameter model on a single 48GB GPU while preserving full 16-bit finetuning task performance. QLORA backpropagates gradients through a frozen, 4-bit quantized pretrained language model into Low Rank Adapters (Lo RA). Our best model family, which we name Guanaco, outperforms all previous openly released models on the Vicuna benchmark, reaching 99.3% of the performance level of Chat GPT while only requiring 24 hours of finetuning on a single GPU. We use QLORA to finetune more than 1,000 models, providing a detailed analysis of instruction following and chatbot performance across 8 instruction datasets, multiple model types (LLa MA, T5), and model scales that would be infeasible to run with regular finetuning (e.g. 33B and 65B parameter models). |
| Researcher Affiliation | Academia | University of Washington {dettmers,artidoro,ahai,lsz}@cs.washington.edu |
| Pseudocode | No | Using the components described above, we define QLORA for a single linear layer in the quantized base model with a single Lo RA adapter as follows: YBF16 = XBF16double Dequant(c FP32 1 , ck-bit 2 , WNF4) + XBF16LBF16 1 LBF16 2 , (5) double Dequant(c FP32 1 , ck-bit 2 , Wk-bit) = dequant(dequant(c FP32 1 , ck-bit 2 ), W4bit) = WBF16, (6)". The paper provides mathematical formulas but not structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | We release all of our models and code, including CUDA kernels for 4-bit training.2" and "We open-source our codebase and CUDA kernels and integrate our methods into the Hugging Face transformers stack [65], making them easily accessible to all." Footnote 2 provides links: "https://github.com/artidoro/qlora and https://github.com/Tim Dettmers/bitsandbytes". |
| Open Datasets | Yes | Data As, to our knowledge, there is no comprehensive study of instruction-following datasets, we select eight recent datasets. We include datasets obtained through crowd-sourcing (OASST1 [31], HH-RLHF [4]), distillation from instruction-tuned models (Alpaca [56], self-instruct [60], unnaturalinstructions [26]), corpora aggregations (FLAN v2 [12]), as well as hybrids (Chip2 [32], Longform [30]). |
| Dataset Splits | Yes | We use the MMLU 5-shot dev set for validation and hyperparameter tuning." and "However, we split the training data in training and validation datasets allowing us to perform more rigorous hyperparameter tuning and early stopping. |
| Hardware Specification | Yes | We present QLORA, an efficient finetuning approach that reduces memory usage enough to finetune a 65B parameter model on a single 48GB GPU" "Our best model family... only requiring 24 hours of finetuning on a single GPU." "paged optimizers are critical to do 33B/65B QLORA tuning on a single 24/48GB GPU" "Figure 3: Speedups of NF4 inference for batch size 1 compared to 16-bit inference for different GPUs. We see that RTX 3090/4090 and A40 GPUs have large speedups of 2.9-4.0x" |
| Software Dependencies | No | We open-source our codebase and CUDA kernels and integrate our methods into the Hugging Face transformers stack [65], making them easily accessible to all." The paper mentions "Hugging Face transformers stack" and "CUDA kernels" but does not provide specific version numbers for these or other software dependencies required for replication. |
| Experiment Setup | Yes | We set Lo RA r = 64, α = 16, and add Lo RA modules on all linear layers of the base model. We also use Adam beta2 of 0.999, max grad norm of 0.3 and Lo RA dropout of 0.1 for models up to 13B and 0.05 for 33B and 65B models. Following previous work on instruction finetuning [63, 61] and after benchmarking other linear and cosine schedules, we use a constant learning rate schedule. We use group-by-length to group examples of similar lengths in the same batch (note this will produce an oscillating loss curve). The hyperparameters we tune for each model size are shown in Table 8." Table 8 provides concrete values for batch size, learning rate, steps, source length, and target length. |