reproducibilityindex.ai

Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning

Authors: Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, Danqi Chen

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the efficacy of our approach by presenting the Sheared LLa MA series, pruning the LLa MA2-7B model down to 1.3B and 2.7B parameters. Sheared-LLa MA models outperform state-of-the-art open-source models of equivalent sizes, such as Pythia, INCITE, Open LLa MA and the concurrent Tiny Llama models, on a wide range of downstream and instruction tuning evaluations, while requiring only 3% of compute compared to training such models from scratch.
Researcher Affiliation	Academia	Mengzhou Xia1, Tianyu Gao1, Zhiyuan Zeng2 , Danqi Chen1 1Princeton Language and Intelligence, Princeton University 2Department of Computer Science and Technology, Tsinghua University {mengzhou,tianyug,danqic}@cs.princeton.edu zengzy20@mails.tsinghua.edu.cn
Pseudocode	Yes	Algorithm 1: Dynamic Batch Loading
Open Source Code	Yes	1Please find our code and models at https://github.com/princeton-nlp/LLM-Shearing.
Open Datasets	Yes	As the training data for LLa MA2 is not publicly accessible, we use Red Pajama (Together AI, 2023b), which is a replicated pre-training dataset of the LLa MA1 models (Touvron et al., 2023a), for pruning and continued-pretraining.
Dataset Splits	Yes	We construct a held-out validation set with 2 million tokens (equivalent to 500 sequences of 4,096 tokens) for each domain.
Hardware Specification	Yes	Inference speed is measured on a Nvidia A100 (80G) GPU, on a singal instance generating up to 512 tokens. ... We performed an inference speed analysis comparing LLMpruner and Sheared-LLa MA’s model architectures using a single A100 GPU to generate up to 2048 tokens. ... Measured with 16 A100 80GB GPUs.
Software Dependencies	No	The paper mentions software like 'fully sharded data parallel', 'Flash Attention V1', and 'Composer', but does not provide specific version numbers for any of these software dependencies.
Experiment Setup	Yes	We present the hyperparameters used in our experiments in Appendix C. ... Table 7: Training hyper-parameters and throughput. Training budget 0.4B 50B Learning rate of z, ϕ, λ 1.0 Learning Rate of θ 0.0001 0.0001 LR warmup ratio 10% 3% Batch size (tokens) 131K 1M Evaluation interval m (steps) 50 400 Steps 3, 200 51, 200 # GPUs 8 16 Throughput (tokens/s) 15K 145K (1.3B) / 77K (2.7B)