Compressing Large Language Models using Low Rank and Low Precision Decomposition

Authors: Rajarshi Saha, Naomi Sagan, Varun Srivastava, Andrea Goldsmith, Mert Pilanci

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Results illustrate that compressing Lla Ma-2 7B/13B/70B and Lla Ma-3 8B models using CALDERA outperforms existing post-training LLM compression techniques in the regime of less than 2.5 bits per parameter.
Researcher Affiliation Academia Rajarshi Saha Stanford University Naomi Sagan Stanford University Varun Srivastava Stanford University Andrea J. Goldsmith Princeton University Mert Pilanci Stanford University
Pseudocode Yes Algorithm 1: CALDERA: Calibration Aware Low-Precision DEcomposition with Low-Rank Adaptation; Algorithm 2: LPLRFACTORIZE(A, k, X, QL, QR, Tin): LPLR factorization submodule
Open Source Code Yes The implementation is available at: https://github.com/pilancilab/caldera.
Open Datasets Yes The performance of CALDERA is evaluated using perplexity on the test splits of the Wikitext2 [25] and C4 [6] datasets, as well as task-specific goodness-of-fit metrics such as zeroshot accuracy for sequence classification. Specifically, zero-shot accuracy was measured on the Winogrande [19], RTE [1, 40], Pi QA [2], ARC-Easy, and ARC-Challenge [4] tasks.
Dataset Splits Yes The calibration dataset is 256 samples in total, with 192 data points in the training split and 64 in the evaluation split.
Hardware Specification Yes Experiments were performed on either NVIDIA RTX A6000, NVIDIA A10G, or NVIDIA H100 GPUs.
Software Dependencies No The paper mentions "Py Torch" and "Hugging Face implementations" but does not specify their version numbers, which is required for a reproducible description of ancillary software.
Experiment Setup Yes For all CALDERA decompositions, the number of alternating iterations between updating Q and L, R (i.e., Tout in Alg. 1) is 15. For decompositions with quantized low-rank factors, except LLa Ma-2 7B and LLa Ma-3 8B, the number of LPLR iterations (i.e., Tin in Alg. 2) is 10. For LLa Ma-2 7B and LLa Ma-3 8B, the number of LPLR iterations is 50. ... RHT fine-tuning was performed for 5 epochs with a learning rate of 10-3. ... Table 8: Hyperparameter settings for low-rank adaptation*. Batch size refers to the per-device batch size. All fine-tuning experiments are parallelized across four GPUs.