SlimGPT: Layer-wise Structured Pruning for Large Language Models

Authors: Gui Ling, Ziyang Wang, YuliangYan , Qingwen Liu

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results on the LLa MA benchmark show that Slim GPT outperforms other methods and achieves state-of-the-art results.
Researcher Affiliation Industry Gui Ling, Ziyang Wang, Yuliang Yan , Qingwen Liu Alibaba Group {linggui.lg, shanyi.wzy, yuliang.yyl, xiangsheng.lqw}@alibaba-inc.com
Pseudocode Yes Algorithm 1 Batched Greedy Pruning for Attention Heads
Open Source Code Yes We submit our source code as an anonymized zip file, accompanied by a detailed replication guide.
Open Datasets Yes We use C4 dataset [30] as the calibration set... The language modeling performance is evaluated on the Wiki Text2 [32] validation set... and the commonsense reasoning capabilities is carried out under a zero-shot setting on the Commonsense Reasoning datasets, which encompass seven diverse subtasks: Bool Q [33], PIQA [34], Hella Swag [35], Wino Grande [36], ARCeasy [37], ARC-challenge [37], and Openbook QA [38].
Dataset Splits Yes From the first shard of C4, we randomly select 256 2048-token sequences for pruning. ... The language modeling performance is evaluated on the Wiki Text2 [32] validation set with sequence length truncated to 128, and the commonsense reasoning capabilities is carried out under a zero-shot setting on the Commonsense Reasoning datasets
Hardware Specification Yes All pruning experiments are conducted on a single A100, while finetuning is performed using two A100s.
Software Dependencies No We utilize the lm-eval-harness framework [39] to conduct these evaluations. The paper does not specify version numbers for this or any other software dependencies.
Experiment Setup Yes We tune with Alpaca datsets [31] for one epoch and utilize the Adam W optimizer with an initial learning rate set to 1e-4, coupled with a cosine annealing schedule for the learning rate. The global batch size is set to 64 and the sequence length is truncated to 256.