Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning
Authors: Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, Danqi Chen
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the efficacy of our approach by presenting the Sheared LLa MA series, pruning the LLa MA2-7B model down to 1.3B and 2.7B parameters. Sheared-LLa MA models outperform state-of-the-art open-source models of equivalent sizes, such as Pythia, INCITE, Open LLa MA and the concurrent Tiny Llama models, on a wide range of downstream and instruction tuning evaluations, while requiring only 3% of compute compared to training such models from scratch. |
| Researcher Affiliation | Academia | Mengzhou Xia1, Tianyu Gao1, Zhiyuan Zeng2 , Danqi Chen1 1Princeton Language and Intelligence, Princeton University 2Department of Computer Science and Technology, Tsinghua University {mengzhou,tianyug,danqic}@cs.princeton.edu zengzy20@mails.tsinghua.edu.cn |
| Pseudocode | Yes | Algorithm 1: Dynamic Batch Loading |
| Open Source Code | Yes | 1Please find our code and models at https://github.com/princeton-nlp/LLM-Shearing. |
| Open Datasets | Yes | As the training data for LLa MA2 is not publicly accessible, we use Red Pajama (Together AI, 2023b), which is a replicated pre-training dataset of the LLa MA1 models (Touvron et al., 2023a), for pruning and continued-pretraining. |
| Dataset Splits | Yes | We construct a held-out validation set with 2 million tokens (equivalent to 500 sequences of 4,096 tokens) for each domain. |
| Hardware Specification | Yes | Inference speed is measured on a Nvidia A100 (80G) GPU, on a singal instance generating up to 512 tokens. ... We performed an inference speed analysis comparing LLMpruner and Sheared-LLa MA’s model architectures using a single A100 GPU to generate up to 2048 tokens. ... Measured with 16 A100 80GB GPUs. |
| Software Dependencies | No | The paper mentions software like 'fully sharded data parallel', 'Flash Attention V1', and 'Composer', but does not provide specific version numbers for any of these software dependencies. |
| Experiment Setup | Yes | We present the hyperparameters used in our experiments in Appendix C. ... Table 7: Training hyper-parameters and throughput. Training budget 0.4B 50B Learning rate of z, ϕ, λ 1.0 Learning Rate of θ 0.0001 0.0001 LR warmup ratio 10% 3% Batch size (tokens) 131K 1M Evaluation interval m (steps) 50 400 Steps 3, 200 51, 200 # GPUs 8 16 Throughput (tokens/s) 15K 145K (1.3B) / 77K (2.7B) |