reproducibilityindex.ai

Learn To be Efficient: Build Structured Sparsity in Large Language Models

Authors: Haizhong Zheng, Xiaoyan Bai, Xueshen Liu, Zhuoqing Morley Mao, Beidi Chen, Fan Lai, Atul Prakash

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive evaluation on language understanding, language generation, and instruction tuning tasks show that LTE consistently outperforms SOTA baselines. Along with our hardware-aware custom kernel implementation, LTE reduces LLa MA2-7B inference latency by 25% at 50% sparsity.
Researcher Affiliation	Academia	University of Michigan Carnegie Mellon University University of Illinois Urbana-Champaign
Pseudocode	No	The paper describes the methodology using text and mathematical formulas but does not include any pseudocode or algorithm blocks.
Open Source Code	Yes	We make our code publicly available at Git Hub1. 1https://github.com/haizhongzheng/LTE
Open Datasets	Yes	Datasets: 1) Natural Language Understanding (NLU): we evaluate on eight tasks from the GLUE dataset [40] SST-2 [34], RTE [5], Co LA [42], MNLI [43], QNLI [29], QQP [13], STS-B [3], and MPRC [6]. 2) Natural Language Generation (NLG): we evaluate LTE on E2E [27], XSum [26], and Wikitext103 [24]. 3) Instruction Tuning: besides downstream tasks, we also evaluate LTE on instruction tuning tasks to evaluate LTE s generalization capabilities. We use Tulu dataset [41] to perform instruction tunning with LTE, and we evaluate models with MMLU benchmark [11]).
Dataset Splits	No	The paper uses standard benchmark datasets like GLUE, XSum, E2E, Wikitext103, Tulu, and MMLU, which often come with predefined splits. However, the paper does not explicitly state the specific training, validation, and test split percentages or sample counts used for these datasets, nor does it explicitly state that standard splits were used for reproduction purposes.
Hardware Specification	Yes	All models are trained and evaluated on A100, A40, and 3090Ti, depending on memory usage and availability. We evaluate wall-clock time speed up using LLa MA2-7B on a single 3090Ti.
Software Dependencies	Yes	In this paper, we use Triton 2.3.0 to implement a customized MLP layer to translate the sparsity to wall-clock time latency reduction.
Experiment Setup	Yes	We presented all training hyperparameters in Table 3, 4, 5, and 6. For hyperparameters presented in {}, we select the best hyperparameter for each task. We follow the settings in Ro BERTa paper [20] to fine-tune Ro BERTa on GLUE datasets. We set the coefficient for the separability loss (λ in Equation 6) to be 0.5 for all stage 1 training. We use different η to control the sparsity of trained models (Figure 10).