reproducibilityindex.ai

Compressing LLMs: The Truth is Rarely Pure and Never Simple

Authors: AJAY KUMAR JAISWAL, Zhe Gan, Xianzhi Du, Bowen Zhang, Zhangyang Wang, Yinfei Yang

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We introduce Knowledge-Intensive Compressed LLM Benchmar K (LLM-KICK), a collection of carefully curated tasks to redefine the evaluation protocol for compressed LLMs, which have significant alignment with their dense counterparts and perplexity fail to capture subtle change in their true capabilities. LLM-KICK unveils many favorable merits and unfortunate plights of current So TA compression methods: all pruning methods suffer significant performance degradation, sometimes at trivial sparsity ratios (e.g., 2530%), and fail for N:M sparsity in knowledge-intensive tasks; current quantization methods are more successful than pruning; yet, pruned LLMs even at 50% sparsity are robust in-context retrieval and summarization systems; among others.
Researcher Affiliation	Collaboration	Ajay Jaiswal1, Zhe Gan2, Xianzhi Du2, Bowen Zhang2, Zhangyang Wang1, Yinfei Yang2 1University of Texas at Austin, 2Apple
Pseudocode	No	The paper does not include any figures, blocks, or sections labeled "Pseudocode" or "Algorithm", nor does it present structured steps formatted like code or an algorithm.
Open Source Code	Yes	The reproduced codes are available at https://github.com/VITA-Group/llm-kick.
Open Datasets	Yes	We use Freebase QA (Jiang et al., 2019) which is a dataset for open-domain QA over the Freebase knowledge graph." and "We use the popular MMLU (Massive Multitask Language Understanding) benchmark which covers 50+ subjects across STEM, Humanities, Social Sciences, and more (Hendrycks et al., 2020)." and Table 1: "Freebase QA (Jiang et al., 2019) https://huggingface.co/datasets/freebase_qa" and "MMLU Benchmark (Hendrycks et al., 2020) https://huggingface.co/datasets/freebase_qa".
Dataset Splits	No	The paper references various datasets (e.g., Freebase QA, MMLU, Trivia QA, CNN/Daily Mail, MT-Bench) but does not explicitly state the specific training, validation, or test dataset splits (e.g., percentages, sample counts, or explicit mention of using standard splits) for reproducing experiments. For CNN/Daily Mail, it mentions creating subsets based on story length, but this is not about data splitting for model training/evaluation.
Hardware Specification	No	The paper mentions hardware requirements for general LLMs (e.g., A100 GPUs for GPT-175B) but does not provide specific details about the hardware used to conduct its own experiments (e.g., exact GPU models, CPU types, or memory specifications).
Software Dependencies	No	The paper mentions various models (e.g., Vicuna, GPTQ, Sparse GPT, Wanda) and tools (Fast Chat, LLM-Judge), and cites general frameworks like PyTorch. However, it does not provide specific version numbers for key software dependencies such as Python, PyTorch, or other libraries used in the experimental setup.
Experiment Setup	Yes	More specifically, we consider a broad range of tasks to evaluate subtle changes in pruned and quantized LLMs ability for language understanding, reasoning, generation, in-context retrieval, long-context summarization, etc." and "We consider two types of sparsities: (i) Unstructured Sparsity: individual model weights are zeroed out independently, leading to irregular zero patterns (Le Cun et al., 1990; Han et al., 2016); and (ii) Structured N:M Sparsity: a fine-grained sparsity pattern in which only N weights are non-zero for every continuous M weights (Nvidia, 2020; Zhou et al., 2021)." and "In this work, we consider ϵ0 to be 5% of the performance of f(x; θ, T)." and "For evaluation, similar to Zheng et al. (2023), we propose to use GPT-4 as a judge, which compares the compressed LLM generated summaries wrt. GPT-3.5 (text-davinci-003) generated summaries." and "Figure 7 illustrates the zero-shot performance of 50% & 70% pruned Vicuna-7B using Wanda and Sparse GPT on knowledge-intensive MMLU benchmark. It is interesting to observe that calibration sample count plays a vital role in preserving the performance of Sparse GPT unlike Wanda.