Extreme Compression of Large Language Models via Additive Quantization
Authors: Vage Egiazarian, Andrei Panferov, Denis Kuznedelev, Elias Frantar, Artem Babenko, Dan Alistarh
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate the AQLM algorithm in typical scenarios for post-training quantization of modern LLMs. Our evaluation is focused on the LLAMA 2 model family since it is a popular backbone for fine-tuned models or general LLM applications, e.g. (Dettmers et al., 2023a), and we also present results on Mistral-family models (Jiang et al., 2024). In Section 4.1, we evaluate the full AQ procedure for various LLAMA 2 models and quantization bit-widths; Section 4.3 presents an ablation analysis for individual AQ components and implementation details. |
| Researcher Affiliation | Collaboration | 1HSE University 2Yandex Research 3Skoltech 4IST Austria 5Neural Magic. |
| Pseudocode | Yes | Algorithm 1 AQLM: Additive Quantization for LLMs |
| Open Source Code | Yes | We share the code for our method in the Git Hub repository https://github.com/Vahe1994/AQLM/tree/AQLM_camera_ready. |
| Open Datasets | Yes | We report perplexity on Wiki Text-2 (Merity et al., 2016) and C4 (Raffel et al., 2020) validation sets. We also measure zero-shot accuracy on Wino Grande (Sakaguchi et al., 2021), Pi QA (Tata & Patel, 2003), Hella Swag (Zellers et al., 2019), ARC-easy and ARC-challenge (Clark et al., 2018) via the LM Eval Harness (Gao et al., 2021). We calibrate each algorithm using the subset of Red Pajama dataset (Computer, 2023), with a sequence length of 4096. |
| Dataset Splits | Yes | We report perplexity on Wiki Text-2 (Merity et al., 2016) & C4 (Raffel et al., 2020) validation sets. We calibrate each algorithm using the subset of Red Pajama dataset (Computer, 2023), with a sequence length of 4096. We fine-tune all models on 4-16M training tokens: 1-4k sequences of length 4k for LLAMA 2 models (Touvron et al., 2023) and 512 sequences of length 8k for Mixtral (Jiang et al., 2024). We fine-tune on the same data as during initial calibration (i.e. samples from Red Pajama (Computer, 2023)). |
| Hardware Specification | Yes | In all of our experiments, we used either Nvidia A100 or H100. The number of GPUs varied from 1 to 8. We used activation offloading to lower pick memory usage. To evaluate inference speed on GPU we used consumer-grade GPU Nvidia 3090 and for CPU setup we used Intel core i9 13900k. |
| Software Dependencies | No | The paper mentions software like PyTorch, JAX, and Adam optimizer, but it does not provide specific version numbers for any of these components, which is required for a reproducible description of ancillary software. |
| Experiment Setup | Yes | For each update phase, our implementation runs 100 Adam steps with learning rate 10^-4. However, we found that the final result is not sensitive to either of these parameters: training with smaller number of steps or learning rate achieves the same loss, but takes longer to converge. AQLM For to get 2, 3, 4 bits: we used 1 codebook size of 2^15 or 2^16, with groups of 8 for 2 bits. For 3 bits we used 2 codebooks size of 2^12 with groups of 8. Finally for 4 bits we used 2 codebooks size of 2^15 or 2^16 with groups of 8. Both for finetuning 3.4 and codebooks update 3.3 we used Adam optimizer (Kingma & Ba, 2015) with learning rate of 10^-4, β1 = 0.90 and β2 = 0.95. |