Discovering Sparsity Allocation for Layer-wise Pruning of Large Language Models

Authors: Lujun Li, Peijie Dong, Zhenheng Tang, Xiang Liu, Qiang Wang, Wenhan Luo, Wei Xue, Qifeng Liu, Xiaowen Chu, Yike Guo

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive experiments on multiple challenging tasks such as arithmetic, knowledge reasoning, and multimodal benchmarks spanning GSM8K, MMLU, SQA, and VQA, demonstrating that our DSA method achieves significant performance gains on the LLa MA-1|2|3, Mistral, and OPT models.
Researcher Affiliation Academia 1Hong Kong University of Science and Technology 2Hong Kong University of Science and Technology (Guangzhou) 3Hong Kong Baptist University 4Harbin Institute of Technology (Shenzhen)
Pseudocode Yes Algorithm 1 Evolutionary Search for Allocation Function Discovery
Open Source Code Yes Codes at: https://github.com/lliai/DSA
Open Datasets Yes We employ a set of seven tasks sourced from the Eleuther AI LM Harness [50]... GSM8K [8] and MMLU [22] datasets... VQAv2 [17], SQA [37], and VQA [47].
Dataset Splits Yes This involves computing the sparsity ratios by applying the candidate function to the sparsity metric, evaluating the pruned model on a validation set using a performance metric, and checking if the pruned model s size satisfies the given constraint... we allocate 20% of the original dataset s training set as a held-out test set for the search process. We meticulously confirm that these validation datasets do not overlap with the test set, preventing any potential data leakage or bias in our evaluations.
Hardware Specification Yes In this way, we search our allocation function in only 0.5 day on a 1 NVIDIA GPU H800 server based on Wanda using perplexity results from the validation set of LLa MA-1-7B on Wiki Text2 [41].
Software Dependencies No The paper does not provide specific version numbers for key software components or libraries, only mentioning general tools like 'Wanda' and 'Sparse GPT' without version details.
Experiment Setup Yes During the search phase, we configure the evolutionary algorithm (Algorithm 1) with a population size of 20, a maximum of 1,000 iterations, a sample ratio of 0.9, and a top-k value of 5.