Learn To be Efficient: Build Structured Sparsity in Large Language Models
Authors: Haizhong Zheng, Xiaoyan Bai, Xueshen Liu, Zhuoqing Morley Mao, Beidi Chen, Fan Lai, Atul Prakash
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive evaluation on language understanding, language generation, and instruction tuning tasks show that LTE consistently outperforms SOTA baselines. Along with our hardware-aware custom kernel implementation, LTE reduces LLa MA2-7B inference latency by 25% at 50% sparsity. |
| Researcher Affiliation | Academia | University of Michigan Carnegie Mellon University University of Illinois Urbana-Champaign |
| Pseudocode | No | The paper describes the methodology using text and mathematical formulas but does not include any pseudocode or algorithm blocks. |
| Open Source Code | Yes | We make our code publicly available at Git Hub1. 1https://github.com/haizhongzheng/LTE |
| Open Datasets | Yes | Datasets: 1) Natural Language Understanding (NLU): we evaluate on eight tasks from the GLUE dataset [40] SST-2 [34], RTE [5], Co LA [42], MNLI [43], QNLI [29], QQP [13], STS-B [3], and MPRC [6]. 2) Natural Language Generation (NLG): we evaluate LTE on E2E [27], XSum [26], and Wikitext103 [24]. 3) Instruction Tuning: besides downstream tasks, we also evaluate LTE on instruction tuning tasks to evaluate LTE s generalization capabilities. We use Tulu dataset [41] to perform instruction tunning with LTE, and we evaluate models with MMLU benchmark [11]). |
| Dataset Splits | No | The paper uses standard benchmark datasets like GLUE, XSum, E2E, Wikitext103, Tulu, and MMLU, which often come with predefined splits. However, the paper does not explicitly state the specific training, validation, and test split percentages or sample counts used for these datasets, nor does it explicitly state that standard splits were used for reproduction purposes. |
| Hardware Specification | Yes | All models are trained and evaluated on A100, A40, and 3090Ti, depending on memory usage and availability. We evaluate wall-clock time speed up using LLa MA2-7B on a single 3090Ti. |
| Software Dependencies | Yes | In this paper, we use Triton 2.3.0 to implement a customized MLP layer to translate the sparsity to wall-clock time latency reduction. |
| Experiment Setup | Yes | We presented all training hyperparameters in Table 3, 4, 5, and 6. For hyperparameters presented in {}, we select the best hyperparameter for each task. We follow the settings in Ro BERTa paper [20] to fine-tune Ro BERTa on GLUE datasets. We set the coefficient for the separability loss (λ in Equation 6) to be 0.5 for all stage 1 training. We use different η to control the sparsity of trained models (Figure 10). |