PV-Tuning: Beyond Straight-Through Estimation for Extreme LLM Compression

Authors: Vladimir Malinovskii, Denis Mazur, Ivan Ilin, Denis Kuznedelev, Konstantin Burlachenko, Kai Yi, Dan Alistarh, Peter Richtarik

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 4 Experiments Figure 1: Wiki Text-2 perplexity (left) and average zero-shot accuracy (right) of 2-bit quantized LLAMA 2 models as a function of model size (Gi B). See detailed setup in Section 4.3. Table 1: Comparing different fine-tuning strategies for VQ, GPTQ and AQLM on LLAMA 2 7B in terms of perplexity on Wiki Text-2, C4 and average zero-shot accuracy on tasks from Section 4.3.
Researcher Affiliation Collaboration Vladimir Malinovskii Yandex, HSE University Denis Mazur MIPT , Sber Devices Ivan Ilin AI Initiative, KAUST Denis Kuznedelev Yandex, Skoltech Konstantin Burlachenko AI Initiative, KAUST Kai Yi AI Initiative, KAUST Dan Alistarh IST Austria, Neural Magic Peter Richtarik AI Initiative, KAUST
Pseudocode Yes Algorithm 1 PV algorithm
Open Source Code Yes The official implementation is available at https://github.com/Vahe1994/AQLM/tree/pv-tuning.
Open Datasets Yes calibrating on the Red Pajama [13] dataset that best approximates the original pre-training data. For full model evaluation, we report quantized model perplexity on Wiki Text-2 [45] dataset.
Dataset Splits Yes We report perplexity on Wiki Text-2 [45] and C4 [54] validation sets, zero-shot accuracy on Wino Grande [60], Pi QA [67], Hella Swag [83], ARC-easy and ARC-challenge [12] via the LM Eval Harness [24]. We follow the exact evaluation setup from GPTQ [22]. We use the same data splits and preprocessing as in most recent PTQ works [22, 42, 18, 70, 21, 71].
Hardware Specification Yes Our code can train 7B LLMs on a single GPU, while larger ones (e.g. 70B) fit into a single machine with 8 A100. We evaluate inference speeds on a single Nvidia RTX 3090 GPU using transformers with cuda graphs.
Software Dependencies No The paper mentions software like Py Torch and transformers with cuda graphs but does not provide specific version numbers for these components. For example, "We use Py Torch Fully Sharded Data Parallel [52, 40]" mentions PyTorch but not its version.
Experiment Setup Yes We use a batch size of 2^20 (∼1M) tokens, split into batches of model-specific sequence length (e.g. 4096 tokens for LLAMA 2, 8192 for LLAMA 3). For all algorithms, we tune learning rate on a logarithmic scale out of (1e-4, 3e-4, 1e-3, 3e-3, 1e-2). We use β1=0.9 and β2=0.95, same as in most LLM training configurations [69, 86, 61].