reproducibilityindex.ai

LLM-Pruner: On the Structural Pruning of Large Language Models

Authors: Xinyin Ma, Gongfan Fang, Xinchao Wang

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To evaluate the effectiveness of LLM-Pruner, we conduct extensive experiments on three large language models: LLa MA-7B, Vicuna-7B, and Chat GLM-6B. The compressed models are evaluated using nine datasets to assess both the generation quality and the zero-shot classification performance of the pruned models. The experimental results demonstrate that even with the removal of 20% of the parameters, the pruned model maintains 94.97% of the performance of the original model.
Researcher Affiliation	Academia	Xinyin Ma Gongfan Fang Xinchao Wang National University of Singapore maxinyin@u.nus.edu, gongfan@u.nus.edu, xinchao@nus.edu.sg
Pseudocode	No	The paper describes the steps of the LLM-Pruner method in text but does not include any pseudocode or algorithm blocks.
Open Source Code	Yes	The code is available at: https://github.com/horseee/LLM-Pruner
Open Datasets	Yes	Evaluation and Datasets. To assess the performance of the model in the task-agnostic setting, we follow LLa MA s evaluation to perform zero-shot task classification on common sense reasoning datasets: Bool Q [6], PIQA [2], Hella Swag [73], Wino Grande [43], ARC-easy [7], ARC-challenge [7] and Openbook QA [38]. Follow [14], the model ranks the choices in the multiple choice tasks or generates the answer in the open-ended generation 3. Additionally, we complement our evaluation with a zero-shot perplexity (PPL) analysis on Wiki Text2 [37] and PTB [35]. ... During the recovery phase, we utilize the cleaned version of Alpaca [49], which comprises approximately 50k samples. ... Consequently, we conduct an experiment aimed at model recovery with more data, employing a dataset comprising 2.59 million samples [59].
Dataset Splits	No	The paper focuses on zero-shot evaluation and fine-tuning with limited data (50k Alpaca samples). While it uses data for calibration and recovery, it does not provide explicit train/validation/test splits with percentages or sample counts for the datasets used in the main evaluation (e.g., BoolQ, PIQA, etc.).
Hardware Specification	Yes	We run our experiment on a single GPU with 24GB memory, using approximately 2.5 hours if RTX4090 is utilized. ... The latency is tested under the test set of Wiki Text2 on a single A5000.
Software Dependencies	No	The paper mentions using the AdamW optimizer but does not provide specific version numbers for software dependencies like Python, PyTorch, or CUDA.
Experiment Setup	Yes	We set the rank d to 8 in our experiment. The learning rate is set to 1e-4 with 100 warming steps. The batch size of training is selected from {64, 128} and the Adam W optimizer is employed in our experiment. The best training epoch we found is 2 epochs, as training with more epochs even has a negative impact on the model performance.