Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Irrational Complex Rotations Empower Low-bit Optimizers

Authors: Zhen Tian, Xin Zhao, Ji-Rong Wen

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate π-Quant on a wide range of tasks. Our experiments show that it can reduce the bit-width of parameters to 3.32-bit, achieving a 41.8% decrease in GPU memory usage, all while maintaining full accuracy. The code is provided at https://github.com/RUCAIBox/Pi-Quant. Experimental results show that π-Quant is an effective quantizer, capable of reducing the bit-width of optimizer states to 3.32 bits, e.g., it can reduce the training memory of Tiny Llama from 19.47 G to 11.32 G, with comparable accuracy. Besides, it consistently outperforms several state-of-the-art quantization methods on a wide range of tasks, highlighting the effectiveness of our approach.
Researcher Affiliation Collaboration Zhen Tian Byte Dance Beijing EMAIL Wayne Xin Zhao GSAI, Renmin University of China Beijing EMAIL Ji-Rong Wen GSAI, Renmin University of China Beijing EMAIL
Pseudocode Yes Algorithm 1: Quantization of the Rotation Angles 1: Input: Parameter T, size: n 2: Split T into two tensors: X and Y 3: Compute w = max(|X, Y |) 4: Scale X and Y according to Eq. (8) 5: Compute α and β using Eq. (3) 6: Compute Ωaccording to Eq. (4) 7: Compute m based on Eq. (7) 8: Calculate Θ from Eq. (9) 9: Output: Quantized parameter Θ, size: n/2; Scale Factor w, size: 1 Algorithm 2: Adam with π-Quant (Differences highlighted) 1: Input: learn rate α, decay rates β1, β2, ϵ 2: Initialize: m0 = Quant(0), v0 = Quant(0), t = 0 {using Alg. 1} 3: for each iteration t = 1, 2, . . . , T do 4: Compute gradient θt 5: mt 1 Restore(mt 1) {Eq. (10)} 6: mt β1mt 1 + (1 β1) θt 7: vt 1 Restore(vt 1) {Eq. (10)} 8: vt β2vt 1 + (1 β2)( θt)2 9: ˆmt mt 1 βt 1 , ˆvt vt 1 βt 2 10: θt θt 1 α ˆ mt ˆvt+ϵ 11: mt Quant(mt) {Alg. 1} 12: vt Quant(vt) {Alg. 1} 13: end for 14: Output: Optimized parameters θT
Open Source Code Yes The code is provided at https://github.com/RUCAIBox/Pi-Quant.
Open Datasets Yes We continually pre-train the Tiny Llama-1.1B checkpoint [11] for 400 steps on the PG-19 [12] dataset, chunked into 64k segments, with a context window of 2048. We report the test perplexity in the Proof-pile dataset, and evaluate the trained LLM across four public benchmarks from Huggingface [13], i.e., ARC-Challenge, Hellaswag, Lambada and PIQA. ... We evaluate π-Quant on five tasks on long-range-arena (LRA) benchmark [14], including Listops [15], Text classification on IMDb review dataset [16], Document Retrieval on AAN dataset [17], Pathfinder [18], and Image classification on CIFAR-10 [19]. Besides, we introduce two Seq2Seq tasks, including text summarization on the Samsum [20] and sequential recommendation on the Movielens [21].
Dataset Splits Yes We continually pre-train the Tiny Llama-1.1B checkpoint [11] for 400 steps on the PG-19 [12] dataset, chunked into 64k segments, with a context window of 2048. We report the test perplexity in the Proof-pile dataset, and evaluate the trained LLM across four public benchmarks from Huggingface [13], i.e., ARC-Challenge, Hellaswag, Lambada and PIQA. ... We evaluate π-Quant on five tasks on long-range-arena (LRA) benchmark [14]...Follow the work [22], we use the 2-layer transformer as the backbone, and the training settings are provided in Appendix F.
Hardware Specification No Our experiments show that it can reduce the bit-width of parameters to 3.32-bit, achieving a 41.8% decrease in GPU memory usage... these methods require additional compilation of GPU-supported operators... and supports parallel computation on GPUs, making it practical in the large scale model optimization.
Software Dependencies No most deep learning frameworks (e.g., Pytorch [5], Tensor Flow [6]) do not support such search quantization operations
Experiment Setup Yes The training settings are provided in Appendix F. ... Table 5: Hyper-parameter setting of the training pipeline. Dataset PG19 LRA Samsum Movie Lens Learning Rate 2e-5 1e-4 3e-5 1e-3 Learning Rate Schedule Linear Linear Linear Weight Decay 0.0 0.0 0.0 0.0 Batch size 64 256 4 2048 β1 0.9 0.9 0.9 0.9 β2 0.95 0.95 0.95 0.95