Discovering Sparsity Allocation for Layer-wise Pruning of Large Language Models
Authors: Lujun Li, Peijie Dong, Zhenheng Tang, Xiang Liu, Qiang Wang, Wenhan Luo, Wei Xue, Qifeng Liu, Xiaowen Chu, Yike Guo
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive experiments on multiple challenging tasks such as arithmetic, knowledge reasoning, and multimodal benchmarks spanning GSM8K, MMLU, SQA, and VQA, demonstrating that our DSA method achieves significant performance gains on the LLa MA-1|2|3, Mistral, and OPT models. |
| Researcher Affiliation | Academia | 1Hong Kong University of Science and Technology 2Hong Kong University of Science and Technology (Guangzhou) 3Hong Kong Baptist University 4Harbin Institute of Technology (Shenzhen) |
| Pseudocode | Yes | Algorithm 1 Evolutionary Search for Allocation Function Discovery |
| Open Source Code | Yes | Codes at: https://github.com/lliai/DSA |
| Open Datasets | Yes | We employ a set of seven tasks sourced from the Eleuther AI LM Harness [50]... GSM8K [8] and MMLU [22] datasets... VQAv2 [17], SQA [37], and VQA [47]. |
| Dataset Splits | Yes | This involves computing the sparsity ratios by applying the candidate function to the sparsity metric, evaluating the pruned model on a validation set using a performance metric, and checking if the pruned model s size satisfies the given constraint... we allocate 20% of the original dataset s training set as a held-out test set for the search process. We meticulously confirm that these validation datasets do not overlap with the test set, preventing any potential data leakage or bias in our evaluations. |
| Hardware Specification | Yes | In this way, we search our allocation function in only 0.5 day on a 1 NVIDIA GPU H800 server based on Wanda using perplexity results from the validation set of LLa MA-1-7B on Wiki Text2 [41]. |
| Software Dependencies | No | The paper does not provide specific version numbers for key software components or libraries, only mentioning general tools like 'Wanda' and 'Sparse GPT' without version details. |
| Experiment Setup | Yes | During the search phase, we configure the evolutionary algorithm (Algorithm 1) with a population size of 20, a maximum of 1,000 iterations, a sample ratio of 0.9, and a top-k value of 5. |