Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
SlimGPT: Layer-wise Structured Pruning for Large Language Models
Authors: Gui Ling, Ziyang Wang, YuliangYan , Qingwen Liu
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results on the LLa MA benchmark show that Slim GPT outperforms other methods and achieves state-of-the-art results. |
| Researcher Affiliation | Industry | Gui Ling, Ziyang Wang, Yuliang Yan , Qingwen Liu Alibaba Group EMAIL |
| Pseudocode | Yes | Algorithm 1 Batched Greedy Pruning for Attention Heads |
| Open Source Code | Yes | We submit our source code as an anonymized zip file, accompanied by a detailed replication guide. |
| Open Datasets | Yes | We use C4 dataset [30] as the calibration set... The language modeling performance is evaluated on the Wiki Text2 [32] validation set... and the commonsense reasoning capabilities is carried out under a zero-shot setting on the Commonsense Reasoning datasets, which encompass seven diverse subtasks: Bool Q [33], PIQA [34], Hella Swag [35], Wino Grande [36], ARCeasy [37], ARC-challenge [37], and Openbook QA [38]. |
| Dataset Splits | Yes | From the first shard of C4, we randomly select 256 2048-token sequences for pruning. ... The language modeling performance is evaluated on the Wiki Text2 [32] validation set with sequence length truncated to 128, and the commonsense reasoning capabilities is carried out under a zero-shot setting on the Commonsense Reasoning datasets |
| Hardware Specification | Yes | All pruning experiments are conducted on a single A100, while finetuning is performed using two A100s. |
| Software Dependencies | No | We utilize the lm-eval-harness framework [39] to conduct these evaluations. The paper does not specify version numbers for this or any other software dependencies. |
| Experiment Setup | Yes | We tune with Alpaca datsets [31] for one epoch and utilize the Adam W optimizer with an initial learning rate set to 1e-4, coupled with a cosine annealing schedule for the learning rate. The global batch size is set to 64 and the sequence length is truncated to 256. |