Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
HALO: Hadamard-Assisted Lower-Precision Optimization for LLMs
Authors: Saleh Ashkboos, Mahdi Nikdan, Rush Tabesh, Roberto Castro, Torsten Hoefler, Dan Alistarh
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Applied to LLAMA-family models, HALO achieves near-full-precision-equivalent results during fine-tuning on various tasks, while delivering up to 1.41 end-to-end speedup for full fine-tuning on RTX 4090 GPUs. [...] We examine the accuracy and performance of HALO for fine-tuning LLAMA and Qwen-family models [13; 43], via both FFT and PEFT. We observe that HALO closely tracks the accuracy of full-precision variants across a wide series of tasks, improving upon the best known prior methods [38; 41] on the more challenging INT8 and FP6 formats. We provide performance measurements per module and end-to-end, with peak speedups of 1.82 and 1.41 for INT8, relative to a well-optimized half-precision baseline. |
| Researcher Affiliation | Academia | Saleh Ashkboos ETH Zurich Mahdi Nikdan ISTAustria Soroush Tabesh ISTAustria Roberto L. Castro ISTAustria Torsten Hoefler ETH Zurich Dan Alistarh ISTAustria & Neural Magic |
| Pseudocode | No | The paper describes the methods narratively and through mathematical equations, such as (1a), (1b), (1c) and (4) through (7), and Table 3 summarizes HALO levels, but there is no explicit pseudocode or algorithm block presented. |
| Open Source Code | Yes | For the paper, we provide an anonymous code. |
| Open Datasets | Yes | For both FFT and PEFT, we consider three standard datasets: 1) Vi GGO [20], with 5.1k training and 1.08k test samples, 2) Grade-School Math (GSM8k) [8], with 7.74k training and 1.32k test samples, and 3) SQL generation [49; 46], with 30k training and 1k test samples. |
| Dataset Splits | Yes | For both FFT and PEFT, we consider three standard datasets: 1) Vi GGO [20], with 5.1k training and 1.08k test samples, 2) Grade-School Math (GSM8k) [8], with 7.74k training and 1.32k test samples, and 3) SQL generation [49; 46], with 30k training and 1k test samples. |
| Hardware Specification | Yes | Speedups are measured on RTX 4090 GPUs with locked clocks, to reduce variance, for: a single linear layer and for end-to-end training. [...] Using four GPUs, we can fit only four samples into GPU memory, whereas with eight GPUs, we can fit up to eight samples, each with 512 tokens. [...] Table 2: End-to-end speedups one epoch LLAMA3-8B full fine-tuning with best performing HALO level using INT8 and FP8. NVIDIA 4x GPUs (RTX-4090) BS=4 BS=8; 8x GPUs (RTX-4090) BS=4 BS=8. |
| Software Dependencies | No | We implement HALO in Py Torch [31] based on the the llm-foundry codebase [25] for FFT, and the standard Hugging Face PEFT library for HALOPEFT. We implement our own low-precision matrix multiplications using the CUTLASS library [29] for all linear modules (except for the LM head and embeddings) and keep the rest of the model in the original precision (BF16) during fine-tuning. For outlier mitigation, we adapt efficient Hadamard CUDA kernels [10]. The paper mentions software packages like Py Torch, CUTLASS, and CUDA kernels but does not provide their specific version numbers. |
| Experiment Setup | Yes | In all experiments, we tune the hyper-parameters on the base BF16 tasks, and re-use the same values for low-precision training. We always perform single-epoch experiments using the Adam W optimizer with β1 = 0.9, β2 = 0.999, and a linear learning rate warm-up of 20 steps. The batch size and sequence length are fixed at 32 and 512. For FFT, we choose learning rates 4 × 10−5, 6 × 10−6, and 3 × 10−5 for Vi GGO, GSM8k, and SQL, respectively, and for PEFT Lo RA experiments, we choose the learning rate 6 × 10−4 and Lo RA rank of 16 for all datasets. |