Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

LowRA: Accurate and Efficient LoRA Fine-Tuning of LLMs under 2 Bits

Authors: Zikai Zhou, Qizheng Zhang, Hermann Kumbong, Kunle Olukotun

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive evaluations across 4 LLMs and 4 datasets show that Low RA achieves a superior performance precision trade-off above 2 bits and remains accurate down to 1.15 bits, reducing memory usage by up to 50%. Our results highlight the potential of ultra-low-bit Lo RA fine-tuning for resource-constrained environments. We extensively evaluate Low RA across 4 LLMs and 4 tasks, benchmarking against state-of-the-art baselines.
Researcher Affiliation Academia 1Department of Computer Science, Stanford University, Stanford, USA. Correspondence to: Zikai Zhou <EMAIL>.
Pseudocode Yes Algorithm 1 Channelwise Precision Assignment
Open Source Code No Open-Source Release: We will open-source Low RA upon publication to foster further research in ultra-low-bit Lo RA fine-tuning.
Open Datasets Yes We use standard datasets across different NLP tasks: Wiki Text-2 (Merity et al., 2016) (language modeling, perplexity), Open Assistant (K opf et al., 2024) (multi-turn conversation, perplexity), XSUM (Narayan et al., 2018) (summarization, ROUGE scores), and CNN/Daily Mail (Hermann et al., 2015) (summarization, ROUGE scores).
Dataset Splits No Each dataset is evaluated using the standard metrics used in prior work. For fine-tuning, we follow QLo RA s setup of using a batch size of 1 and sequence length of 512.
Hardware Specification Yes Hardware Platform Experiments are conducted on NVIDIA A100 GPUs (80GB memory). Each LLa MA experiment runs on a single dedicated GPU. Each BART-large experiment runs two instances concurrently on a single GPU. Across LLa MA-7B on an RTX 3080, 1.5-bit Low RA delivers a 3.42 throughput increase over QLo RA (32.16 vs. 9.40 tokens per second). On LLa MA-13B with an RTX A4000, the same 1.5-bit configuration still yields a 1.39 speed-up. We benchmarked Low RA at multiple bit-widths against QLo RA (4 bit) on an RTX A5000 (24 GB) and an A100 (80 GB).
Software Dependencies No We build our two-level ILP pipeline using the opensourced Coin-Or Branch and Cut (CBC) (Saltzman, 2002) solver via the Python-based modeling library Pu LP (Mitchell et al., 2011). We integrate this into the bitsandbytes8 library for usability.
Experiment Setup Yes Hyperparameters For a fair comparison, we use identical hyperparameters across all methods, consistent with QLo RA (Dettmers et al., 2024) and Loft Q (Li et al., 2023). Details on selected hyperparameters are in Appendix I. Appendix I provides Table 10 (Hyperparameters used for all LLa MA experiments on Wikitext-2), Table 11 (Hyperparameters for fine-tuning Bart-Large on CNN/Daily Mail), Table 12 (Hyperparameters for fine-tuning Bart-Large on XSUM), and Table 13 (Hyperparameters used for all Llama experiments on Open Assistant (oasst1)), all listing specific values like 'lora r 64', 'lora alpha 64', 'learning rate 0.0003', 'per device train batch size 16', 'gradient accumulation steps 4', 'max steps 126', 'warmup ratio 0.03', etc.