LLM-Pruner: On the Structural Pruning of Large Language Models
Authors: Xinyin Ma, Gongfan Fang, Xinchao Wang
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To evaluate the effectiveness of LLM-Pruner, we conduct extensive experiments on three large language models: LLa MA-7B, Vicuna-7B, and Chat GLM-6B. The compressed models are evaluated using nine datasets to assess both the generation quality and the zero-shot classification performance of the pruned models. The experimental results demonstrate that even with the removal of 20% of the parameters, the pruned model maintains 94.97% of the performance of the original model. |
| Researcher Affiliation | Academia | Xinyin Ma Gongfan Fang Xinchao Wang National University of Singapore maxinyin@u.nus.edu, gongfan@u.nus.edu, xinchao@nus.edu.sg |
| Pseudocode | No | The paper describes the steps of the LLM-Pruner method in text but does not include any pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code is available at: https://github.com/horseee/LLM-Pruner |
| Open Datasets | Yes | Evaluation and Datasets. To assess the performance of the model in the task-agnostic setting, we follow LLa MA s evaluation to perform zero-shot task classification on common sense reasoning datasets: Bool Q [6], PIQA [2], Hella Swag [73], Wino Grande [43], ARC-easy [7], ARC-challenge [7] and Openbook QA [38]. Follow [14], the model ranks the choices in the multiple choice tasks or generates the answer in the open-ended generation 3. Additionally, we complement our evaluation with a zero-shot perplexity (PPL) analysis on Wiki Text2 [37] and PTB [35]. ... During the recovery phase, we utilize the cleaned version of Alpaca [49], which comprises approximately 50k samples. ... Consequently, we conduct an experiment aimed at model recovery with more data, employing a dataset comprising 2.59 million samples [59]. |
| Dataset Splits | No | The paper focuses on zero-shot evaluation and fine-tuning with limited data (50k Alpaca samples). While it uses data for calibration and recovery, it does not provide explicit train/validation/test splits with percentages or sample counts for the datasets used in the main evaluation (e.g., BoolQ, PIQA, etc.). |
| Hardware Specification | Yes | We run our experiment on a single GPU with 24GB memory, using approximately 2.5 hours if RTX4090 is utilized. ... The latency is tested under the test set of Wiki Text2 on a single A5000. |
| Software Dependencies | No | The paper mentions using the AdamW optimizer but does not provide specific version numbers for software dependencies like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | We set the rank d to 8 in our experiment. The learning rate is set to 1e-4 with 100 warming steps. The batch size of training is selected from {64, 128} and the Adam W optimizer is employed in our experiment. The best training epoch we found is 2 epochs, as training with more epochs even has a negative impact on the model performance. |