Compressing LLMs: The Truth is Rarely Pure and Never Simple

Authors: AJAY KUMAR JAISWAL, Zhe Gan, Xianzhi Du, Bowen Zhang, Zhangyang Wang, Yinfei Yang

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We introduce Knowledge-Intensive Compressed LLM Benchmar K (LLM-KICK), a collection of carefully curated tasks to redefine the evaluation protocol for compressed LLMs, which have significant alignment with their dense counterparts and perplexity fail to capture subtle change in their true capabilities. LLM-KICK unveils many favorable merits and unfortunate plights of current So TA compression methods: all pruning methods suffer significant performance degradation, sometimes at trivial sparsity ratios (e.g., 2530%), and fail for N:M sparsity in knowledge-intensive tasks; current quantization methods are more successful than pruning; yet, pruned LLMs even at 50% sparsity are robust in-context retrieval and summarization systems; among others.
Researcher Affiliation Collaboration Ajay Jaiswal1, Zhe Gan2, Xianzhi Du2, Bowen Zhang2, Zhangyang Wang1, Yinfei Yang2 1University of Texas at Austin, 2Apple
Pseudocode No The paper does not include any figures, blocks, or sections labeled "Pseudocode" or "Algorithm", nor does it present structured steps formatted like code or an algorithm.
Open Source Code Yes The reproduced codes are available at https://github.com/VITA-Group/llm-kick.
Open Datasets Yes We use Freebase QA (Jiang et al., 2019) which is a dataset for open-domain QA over the Freebase knowledge graph." and "We use the popular MMLU (Massive Multitask Language Understanding) benchmark which covers 50+ subjects across STEM, Humanities, Social Sciences, and more (Hendrycks et al., 2020)." and Table 1: "Freebase QA (Jiang et al., 2019) https://huggingface.co/datasets/freebase_qa" and "MMLU Benchmark (Hendrycks et al., 2020) https://huggingface.co/datasets/freebase_qa".
Dataset Splits No The paper references various datasets (e.g., Freebase QA, MMLU, Trivia QA, CNN/Daily Mail, MT-Bench) but does not explicitly state the specific training, validation, or test dataset splits (e.g., percentages, sample counts, or explicit mention of using standard splits) for reproducing experiments. For CNN/Daily Mail, it mentions creating subsets based on story length, but this is not about data splitting for model training/evaluation.
Hardware Specification No The paper mentions hardware requirements for general LLMs (e.g., A100 GPUs for GPT-175B) but does not provide specific details about the hardware used to conduct its own experiments (e.g., exact GPU models, CPU types, or memory specifications).
Software Dependencies No The paper mentions various models (e.g., Vicuna, GPTQ, Sparse GPT, Wanda) and tools (Fast Chat, LLM-Judge), and cites general frameworks like PyTorch. However, it does not provide specific version numbers for key software dependencies such as Python, PyTorch, or other libraries used in the experimental setup.
Experiment Setup Yes More specifically, we consider a broad range of tasks to evaluate subtle changes in pruned and quantized LLMs ability for language understanding, reasoning, generation, in-context retrieval, long-context summarization, etc." and "We consider two types of sparsities: (i) Unstructured Sparsity: individual model weights are zeroed out independently, leading to irregular zero patterns (Le Cun et al., 1990; Han et al., 2016); and (ii) Structured N:M Sparsity: a fine-grained sparsity pattern in which only N weights are non-zero for every continuous M weights (Nvidia, 2020; Zhou et al., 2021)." and "In this work, we consider ϵ0 to be 5% of the performance of f(x; θ, T)." and "For evaluation, similar to Zheng et al. (2023), we propose to use GPT-4 as a judge, which compares the compressed LLM generated summaries wrt. GPT-3.5 (text-davinci-003) generated summaries." and "Figure 7 illustrates the zero-shot performance of 50% & 70% pruned Vicuna-7B using Wanda and Sparse GPT on knowledge-intensive MMLU benchmark. It is interesting to observe that calibration sample count plays a vital role in preserving the performance of Sparse GPT unlike Wanda.