Compressing Large Language Models using Low Rank and Low Precision Decomposition
Authors: Rajarshi Saha, Naomi Sagan, Varun Srivastava, Andrea Goldsmith, Mert Pilanci
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Results illustrate that compressing Lla Ma-2 7B/13B/70B and Lla Ma-3 8B models using CALDERA outperforms existing post-training LLM compression techniques in the regime of less than 2.5 bits per parameter. |
| Researcher Affiliation | Academia | Rajarshi Saha Stanford University Naomi Sagan Stanford University Varun Srivastava Stanford University Andrea J. Goldsmith Princeton University Mert Pilanci Stanford University |
| Pseudocode | Yes | Algorithm 1: CALDERA: Calibration Aware Low-Precision DEcomposition with Low-Rank Adaptation; Algorithm 2: LPLRFACTORIZE(A, k, X, QL, QR, Tin): LPLR factorization submodule |
| Open Source Code | Yes | The implementation is available at: https://github.com/pilancilab/caldera. |
| Open Datasets | Yes | The performance of CALDERA is evaluated using perplexity on the test splits of the Wikitext2 [25] and C4 [6] datasets, as well as task-specific goodness-of-fit metrics such as zeroshot accuracy for sequence classification. Specifically, zero-shot accuracy was measured on the Winogrande [19], RTE [1, 40], Pi QA [2], ARC-Easy, and ARC-Challenge [4] tasks. |
| Dataset Splits | Yes | The calibration dataset is 256 samples in total, with 192 data points in the training split and 64 in the evaluation split. |
| Hardware Specification | Yes | Experiments were performed on either NVIDIA RTX A6000, NVIDIA A10G, or NVIDIA H100 GPUs. |
| Software Dependencies | No | The paper mentions "Py Torch" and "Hugging Face implementations" but does not specify their version numbers, which is required for a reproducible description of ancillary software. |
| Experiment Setup | Yes | For all CALDERA decompositions, the number of alternating iterations between updating Q and L, R (i.e., Tout in Alg. 1) is 15. For decompositions with quantized low-rank factors, except LLa Ma-2 7B and LLa Ma-3 8B, the number of LPLR iterations (i.e., Tin in Alg. 2) is 10. For LLa Ma-2 7B and LLa Ma-3 8B, the number of LPLR iterations is 50. ... RHT fine-tuning was performed for 5 epochs with a learning rate of 10-3. ... Table 8: Hyperparameter settings for low-rank adaptation*. Batch size refers to the per-device batch size. All fine-tuning experiments are parallelized across four GPUs. |