Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning

Authors: Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, Danqi Chen

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the efficacy of our approach by presenting the Sheared LLa MA series, pruning the LLa MA2-7B model down to 1.3B and 2.7B parameters. Sheared-LLa MA models outperform state-of-the-art open-source models of equivalent sizes, such as Pythia, INCITE, Open LLa MA and the concurrent Tiny Llama models, on a wide range of downstream and instruction tuning evaluations, while requiring only 3% of compute compared to training such models from scratch.
Researcher Affiliation Academia Mengzhou Xia1, Tianyu Gao1, Zhiyuan Zeng2 , Danqi Chen1 1Princeton Language and Intelligence, Princeton University 2Department of Computer Science and Technology, Tsinghua University {mengzhou,tianyug,danqic}@cs.princeton.edu zengzy20@mails.tsinghua.edu.cn
Pseudocode Yes Algorithm 1: Dynamic Batch Loading
Open Source Code Yes 1Please find our code and models at https://github.com/princeton-nlp/LLM-Shearing.
Open Datasets Yes As the training data for LLa MA2 is not publicly accessible, we use Red Pajama (Together AI, 2023b), which is a replicated pre-training dataset of the LLa MA1 models (Touvron et al., 2023a), for pruning and continued-pretraining.
Dataset Splits Yes We construct a held-out validation set with 2 million tokens (equivalent to 500 sequences of 4,096 tokens) for each domain.
Hardware Specification Yes Inference speed is measured on a Nvidia A100 (80G) GPU, on a singal instance generating up to 512 tokens. ... We performed an inference speed analysis comparing LLMpruner and Sheared-LLa MA’s model architectures using a single A100 GPU to generate up to 2048 tokens. ... Measured with 16 A100 80GB GPUs.
Software Dependencies No The paper mentions software like 'fully sharded data parallel', 'Flash Attention V1', and 'Composer', but does not provide specific version numbers for any of these software dependencies.
Experiment Setup Yes We present the hyperparameters used in our experiments in Appendix C. ... Table 7: Training hyper-parameters and throughput. Training budget 0.4B 50B Learning rate of z, ϕ, λ 1.0 Learning Rate of θ 0.0001 0.0001 LR warmup ratio 10% 3% Batch size (tokens) 131K 1M Evaluation interval m (steps) 50 400 Steps 3, 200 51, 200 # GPUs 8 16 Throughput (tokens/s) 15K 145K (1.3B) / 77K (2.7B)