Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning
Authors: Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, Danqi Chen
ICLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the efficacy of our approach by presenting the Sheared LLa MA series, pruning the LLa MA2-7B model down to 1.3B and 2.7B parameters. Sheared-LLa MA models outperform state-of-the-art open-source models of equivalent sizes, such as Pythia, INCITE, Open LLa MA and the concurrent Tiny Llama models, on a wide range of downstream and instruction tuning evaluations, while requiring only 3% of compute compared to training such models from scratch. |
| Researcher Affiliation | Academia | Mengzhou Xia1, Tianyu Gao1, Zhiyuan Zeng2 , Danqi Chen1 1Princeton Language and Intelligence, Princeton University 2Department of Computer Science and Technology, Tsinghua University EMAIL EMAIL |
| Pseudocode | Yes | Algorithm 1: Dynamic Batch Loading |
| Open Source Code | Yes | 1Please find our code and models at https://github.com/princeton-nlp/LLM-Shearing. |
| Open Datasets | Yes | As the training data for LLa MA2 is not publicly accessible, we use Red Pajama (Together AI, 2023b), which is a replicated pre-training dataset of the LLa MA1 models (Touvron et al., 2023a), for pruning and continued-pretraining. |
| Dataset Splits | Yes | We construct a held-out validation set with 2 million tokens (equivalent to 500 sequences of 4,096 tokens) for each domain. |
| Hardware Specification | Yes | Inference speed is measured on a Nvidia A100 (80G) GPU, on a singal instance generating up to 512 tokens. ... We performed an inference speed analysis comparing LLMpruner and Sheared-LLa MA’s model architectures using a single A100 GPU to generate up to 2048 tokens. ... Measured with 16 A100 80GB GPUs. |
| Software Dependencies | No | The paper mentions software like 'fully sharded data parallel', 'Flash Attention V1', and 'Composer', but does not provide specific version numbers for any of these software dependencies. |
| Experiment Setup | Yes | We present the hyperparameters used in our experiments in Appendix C. ... Table 7: Training hyper-parameters and throughput. Training budget 0.4B 50B Learning rate of z, ϕ, λ 1.0 Learning Rate of θ 0.0001 0.0001 LR warmup ratio 10% 3% Batch size (tokens) 131K 1M Evaluation interval m (steps) 50 400 Steps 3, 200 51, 200 # GPUs 8 16 Throughput (tokens/s) 15K 145K (1.3B) / 77K (2.7B) |