Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

SlimGPT: Layer-wise Structured Pruning for Large Language Models

Authors: Gui Ling, Ziyang Wang, YuliangYan , Qingwen Liu

NeurIPS 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results on the LLa MA benchmark show that Slim GPT outperforms other methods and achieves state-of-the-art results.
Researcher Affiliation Industry Gui Ling, Ziyang Wang, Yuliang Yan , Qingwen Liu Alibaba Group EMAIL
Pseudocode Yes Algorithm 1 Batched Greedy Pruning for Attention Heads
Open Source Code Yes We submit our source code as an anonymized zip file, accompanied by a detailed replication guide.
Open Datasets Yes We use C4 dataset [30] as the calibration set... The language modeling performance is evaluated on the Wiki Text2 [32] validation set... and the commonsense reasoning capabilities is carried out under a zero-shot setting on the Commonsense Reasoning datasets, which encompass seven diverse subtasks: Bool Q [33], PIQA [34], Hella Swag [35], Wino Grande [36], ARCeasy [37], ARC-challenge [37], and Openbook QA [38].
Dataset Splits Yes From the first shard of C4, we randomly select 256 2048-token sequences for pruning. ... The language modeling performance is evaluated on the Wiki Text2 [32] validation set with sequence length truncated to 128, and the commonsense reasoning capabilities is carried out under a zero-shot setting on the Commonsense Reasoning datasets
Hardware Specification Yes All pruning experiments are conducted on a single A100, while finetuning is performed using two A100s.
Software Dependencies No We utilize the lm-eval-harness framework [39] to conduct these evaluations. The paper does not specify version numbers for this or any other software dependencies.
Experiment Setup Yes We tune with Alpaca datsets [31] for one epoch and utilize the Adam W optimizer with an initial learning rate set to 1e-4, coupled with a cosine annealing schedule for the learning rate. The global batch size is set to 64 and the sequence length is truncated to 256.