Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

FALQON: Accelerating LoRA Fine-tuning with Low-Bit Floating-Point Arithmetic

Authors: Kanghyun Choi, Hyeyoon Lee, Sunjong Park, Dain Kwon, Jinho Lee

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental evaluations demonstrate that FALQON achieves approximately a 3 training speedup over existing quantized Lo RA methods with a similar level of accuracy, providing a practical solution for efficient large-scale model fine-tuning. Through extensive experiments on various tasks, we demonstrate that FALQON achieves up to 3 faster fine-tuning compared to quantized Lo RA baselines, while maintaining comparable accuracy.
Researcher Affiliation Academia Department of Electrical and Computer Engineering Seoul National University EMAIL
Pseudocode Yes Algorithm 1 FALQON: Initialization Algorithm 2 FALQON: Gradient Computation Algorithm 3 FALQON: Backward and Update Algorithm 4 Overall Framework of FALQON
Open Source Code Yes Code is available at https://github.com/iamkanghyunchoi/falqon. We provide the implementation of FALQON in a public Git Hub repository: https://github.com/iamkanghyunchoi/falqon.
Open Datasets Yes We fine-tune LLa MA-7B and 13B on the Alpaca [42] and OASST1 [23] datasets, then evaluate on the Massively Multitask Language Understanding (MMLU [19]) benchmark, which measures knowledge and reasoning across a diverse set of domains.
Dataset Splits Yes We assess our models on Hella Swag [53], PIQA [3], Wino Grande [36], ARC [10], Bool Q [9], and Open Book QA [30] for commonsense QA, following QA-Lo RA s five-shot evaluation protocol via the lm-eval-harness framework.
Hardware Specification Yes All computational cost evaluations (Figure 3 and Tables 1a, 1b and 2) and detailed breakdown analyses (Figures 1a, 1b and 4) are conducted on a dedicated node of a local computing cluster. The computing node contains two Intel Xeon Gold 6442Y CPUs (total of 48 physical cores, 96 threads) running at 2.60 GHz, equipped with 1TB DDR5 ECC memory. The node has NVIDIA Ge Force RTX 4090 GPUs (24GB VRAM) with NVIDIA driver version 550.54.15 and CUDA 12.4, running on Ubuntu 22.04.4 LTS with Linux kernel 5.15.0-94-generic. Unless otherwise noted, all experiments utilize a single RTX 4090 GPU from this node.
Software Dependencies Yes NVIDIA driver version 550.54.15 and CUDA 12.4, running on Ubuntu 22.04.4 LTS with Linux kernel 5.15.0-94-generic.
Experiment Setup Yes We follow the settings of QLo RA: a Paged Adam W optimizer with a batch size of 16, learning rate of 2e-5, and 1,875 training steps. Refer to the Appendix for the detailed settings. [...] In the quantized-Lo RA baselines, we adopt weight-only quantization: NF4 for QLo RA and IR-QLo RA, and INT4 for QA-Lo RA. For FP6-LLM, we use weight-only FP6-E3M2 and FP6-E2M3 quantization. For FP8-based methods (Fishman et al. , Torch AO, and FALQON), we use FP8-E4M3 for weights and activations, and FP8-E5M2 for gradients. We follow the settings of QLo RA: a Paged Adam W optimizer with batch size of 16, max gradient norm of 0.3, learning rate of 2e-4, k=10, and 1,875 training steps.