SlimGPT: Layer-wise Structured Pruning for Large Language Models
Authors: Gui Ling, Ziyang Wang, YuliangYan , Qingwen Liu
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results on the LLa MA benchmark show that Slim GPT outperforms other methods and achieves state-of-the-art results. |
| Researcher Affiliation | Industry | Gui Ling, Ziyang Wang, Yuliang Yan , Qingwen Liu Alibaba Group {linggui.lg, shanyi.wzy, yuliang.yyl, xiangsheng.lqw}@alibaba-inc.com |
| Pseudocode | Yes | Algorithm 1 Batched Greedy Pruning for Attention Heads |
| Open Source Code | Yes | We submit our source code as an anonymized zip file, accompanied by a detailed replication guide. |
| Open Datasets | Yes | We use C4 dataset [30] as the calibration set... The language modeling performance is evaluated on the Wiki Text2 [32] validation set... and the commonsense reasoning capabilities is carried out under a zero-shot setting on the Commonsense Reasoning datasets, which encompass seven diverse subtasks: Bool Q [33], PIQA [34], Hella Swag [35], Wino Grande [36], ARCeasy [37], ARC-challenge [37], and Openbook QA [38]. |
| Dataset Splits | Yes | From the first shard of C4, we randomly select 256 2048-token sequences for pruning. ... The language modeling performance is evaluated on the Wiki Text2 [32] validation set with sequence length truncated to 128, and the commonsense reasoning capabilities is carried out under a zero-shot setting on the Commonsense Reasoning datasets |
| Hardware Specification | Yes | All pruning experiments are conducted on a single A100, while finetuning is performed using two A100s. |
| Software Dependencies | No | We utilize the lm-eval-harness framework [39] to conduct these evaluations. The paper does not specify version numbers for this or any other software dependencies. |
| Experiment Setup | Yes | We tune with Alpaca datsets [31] for one epoch and utilize the Adam W optimizer with an initial learning rate set to 1e-4, coupled with a cosine annealing schedule for the learning rate. The global batch size is set to 64 and the sequence length is truncated to 256. |