Extreme Compression of Large Language Models via Additive Quantization

Authors: Vage Egiazarian, Andrei Panferov, Denis Kuznedelev, Elias Frantar, Artem Babenko, Dan Alistarh

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate the AQLM algorithm in typical scenarios for post-training quantization of modern LLMs. Our evaluation is focused on the LLAMA 2 model family since it is a popular backbone for fine-tuned models or general LLM applications, e.g. (Dettmers et al., 2023a), and we also present results on Mistral-family models (Jiang et al., 2024). In Section 4.1, we evaluate the full AQ procedure for various LLAMA 2 models and quantization bit-widths; Section 4.3 presents an ablation analysis for individual AQ components and implementation details.
Researcher Affiliation Collaboration 1HSE University 2Yandex Research 3Skoltech 4IST Austria 5Neural Magic.
Pseudocode Yes Algorithm 1 AQLM: Additive Quantization for LLMs
Open Source Code Yes We share the code for our method in the Git Hub repository https://github.com/Vahe1994/AQLM/tree/AQLM_camera_ready.
Open Datasets Yes We report perplexity on Wiki Text-2 (Merity et al., 2016) and C4 (Raffel et al., 2020) validation sets. We also measure zero-shot accuracy on Wino Grande (Sakaguchi et al., 2021), Pi QA (Tata & Patel, 2003), Hella Swag (Zellers et al., 2019), ARC-easy and ARC-challenge (Clark et al., 2018) via the LM Eval Harness (Gao et al., 2021). We calibrate each algorithm using the subset of Red Pajama dataset (Computer, 2023), with a sequence length of 4096.
Dataset Splits Yes We report perplexity on Wiki Text-2 (Merity et al., 2016) & C4 (Raffel et al., 2020) validation sets. We calibrate each algorithm using the subset of Red Pajama dataset (Computer, 2023), with a sequence length of 4096. We fine-tune all models on 4-16M training tokens: 1-4k sequences of length 4k for LLAMA 2 models (Touvron et al., 2023) and 512 sequences of length 8k for Mixtral (Jiang et al., 2024). We fine-tune on the same data as during initial calibration (i.e. samples from Red Pajama (Computer, 2023)).
Hardware Specification Yes In all of our experiments, we used either Nvidia A100 or H100. The number of GPUs varied from 1 to 8. We used activation offloading to lower pick memory usage. To evaluate inference speed on GPU we used consumer-grade GPU Nvidia 3090 and for CPU setup we used Intel core i9 13900k.
Software Dependencies No The paper mentions software like PyTorch, JAX, and Adam optimizer, but it does not provide specific version numbers for any of these components, which is required for a reproducible description of ancillary software.
Experiment Setup Yes For each update phase, our implementation runs 100 Adam steps with learning rate 10^-4. However, we found that the final result is not sensitive to either of these parameters: training with smaller number of steps or learning rate achieves the same loss, but takes longer to converge. AQLM For to get 2, 3, 4 bits: we used 1 codebook size of 2^15 or 2^16, with groups of 8 for 2 bits. For 3 bits we used 2 codebooks size of 2^12 with groups of 8. Finally for 4 bits we used 2 codebooks size of 2^15 or 2^16 with groups of 8. Both for finetuning 3.4 and codebooks update 3.3 we used Adam optimizer (Kingma & Ba, 2015) with learning rate of 10^-4, β1 = 0.90 and β2 = 0.95.