Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

LoTA-QAF: Lossless Ternary Adaptation for Quantization-Aware Fine-Tuning

Authors: Junyu Chen, Junzhuo Li, Zhen Peng, Wenjie Wang, Yuxiang Ren, Long Shi, Xuming Hu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We apply Lo TA-QAF to Llama-3.1/3.3 and Qwen-2.5 model families and validate its effectiveness on several downstream tasks. On the MMLU benchmark, our method effectively recovers performance for quantized models, surpassing 16-bit Lo RA by up to 5.14%. For task-specific fine-tuning, 16-bit Lo RA achieves superior results, but Lo TA-QAF still outperforms other methods. Code is available in github.com/Kingdalf Goodman/Lo TA-QAF.
Researcher Affiliation	Academia	Junyu Chen1,2,3, Junzhuo Li3, Zhen Peng4, Wenjie Wang1,2, Yuxiang Ren5, Long Shi1,2, , Xuming Hu3, 1 Southwestern University of Finance and Economics 2 Artificial Intelligence and Digital Finance Key Laboratory of Sichuan Province 3 The Hong Kong University of Science and Technology (Guangzhou) 4 Sun Yat-sen University 5 Nanjing University EMAIL EMAIL EMAIL
Pseudocode	No	The paper describes the components of the method (ternary adaptation, lossless merging mechanism, t-Sign SGD) and provides mathematical equations for them, but it does not present a structured pseudocode block or an algorithm figure.
Open Source Code	Yes	Code is available in github.com/Kingdalf Goodman/Lo TA-QAF.
Open Datasets	Yes	We conduct experiments on several large language models: Llama 3.1 8B, Qwen 2.5 14B, Qwen 2.5 32B, and Llama 3.3 70B. GPTQ (Frantar et al., 2022) asymmetric quantization is applied to all these models, with a group size of 64 for Llama 3.1 8B and Qwen 2.5 14B, and 128 for Qwen 2.5 32B and Llama 3.3 70B. For calibration, we use 1024 samples from the C4 dataset (Raffel et al., 2019). ... For performance-recovery fine-tuning, we utilize the Alpaca (Taori et al., 2023) and subsequently evaluate 5-shot performance on the Massively Multitask Language Understanding (MMLU) benchmark (Hendrycks et al., 2020). For task-specific fine-tuning, we select three datasets: GSM8K (Cobbe et al., 2021), with 7.47k training and 1.32k test samples; SQL generation (Yu et al., 2018; Zhong et al., 2017), with 30k training and 1k test samples; Vi GGO (Juraska et al., 2019), with 5.1k training and 1.08k test samples.
Dataset Splits	Yes	For calibration, we use 1024 samples from the C4 dataset (Raffel et al., 2019). ... For task-specific fine-tuning, we select three datasets: GSM8K (Cobbe et al., 2021), with 7.47k training and 1.32k test samples; SQL generation (Yu et al., 2018; Zhong et al., 2017), with 30k training and 1k test samples; Vi GGO (Juraska et al., 2019), with 5.1k training and 1.08k test samples.
Hardware Specification	Yes	All experiments are conducted on one NVIDIA A800 GPU.
Software Dependencies	No	The paper mentions "Py Torch framework" and "Triton" along with specific kernels like "Triton V2Quant Linear kernel" and "Torch Quant Linear kernel". However, it does not provide specific version numbers for PyTorch or Triton itself.
Experiment Setup	Yes	Following QA-Lo RA, we use a paged Adam W optimizer, a maximum gradient norm of 0.3, a batch size of 64, a source length of 1024, and a target length of 256. For performance-recovery fine-tuning, we set the learning rate to 1 10 5 for the 8B and 14B models, and 5 10 6 for the 32B and 70B models. The number of fine-tuning steps is 300 for Alpaca. For task-specific fine-tuning, the learning rate is set to 5 10 4 for the 8B and 14B models, and 1 10 4 for the 32B and 70B models. Single-epoch experiments are performed on the training sets of GSM8k, SQL generation, and Vi GGO. Regarding adapter settings, the rank r is 64 for the 8B and 14B models, and 32 for the 32B and 70B models. Additionally, the coefficient α is twice the rank. Regarding the hyper-parameters of Lo TA-QAF, we set the threshold ω to 0.75r for Alpaca, GSM8K and SQL generation, and ω to 0.875r for Vi GGO. The dynamic percentile-based threshold, σt, is initialized to the top 5% and linearly decays to 0.1% during the first 80% of the training phase. For the final 20% of training (i.e., from 80% to 100% completion), it is fixed at 0.01%.