ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization
Authors: Haoran You, Yipin Guo, Yichao Fu, Wei Zhou, Huihong Shi, Xiaofan Zhang, Souvik Kundu, Amir Yazdanbakhsh, Yingyan (Celine) Lin
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on five LLM families and eight tasks consistently validate the effectiveness of Shift Add LLM, achieving average perplexity reductions of 5.6 and 22.7 points at comparable or lower latency compared to the most competitive quantized LLMs at 3and 2-bit precision, respectively, and more than 80% memory and energy reductions over the original LLMs. |
| Researcher Affiliation | Collaboration | Haoran You , Yipin Guo , Yichao Fu , Wei Zhou , Huihong Shi , Xiaofan Zhang Souvik Kundu , Amir Yazdanbakhsh , Yingyan (Celine) Lin Georgia Institute of Technology Intel Labs Google Google Deep Mind |
| Pseudocode | Yes | Algorithm 1 Alternating Multi-bit BCQ [64] |
| Open Source Code | Yes | Codes and models are available at https://github.com/GATECH-EIC/ Shift Add LLM. |
| Open Datasets | Yes | Tasks and Datasets. We evaluate all five LLMs on the commonly adopted language modeling task using the Wiki Text-2 [41] dataset for perplexity measurement. Additionally, we extend the evaluation of the two largest models, OPT-66B and LLa MA-2-70B, to eight downstream tasks for zero-shot accuracy evaluation. These tasks include ARC (Challenge/Easy) [4], Bool Q [9], Copa [1], PIQA [56], RTE [11], Story Cloze [43], and MMLU [26]. |
| Dataset Splits | No | The paper states using Wiki Text-2 for perplexity measurement and downstream tasks for zero-shot accuracy evaluation, but it does not explicitly provide the train/test/validation dataset splits with percentages, sample counts, or specific predefined split citations. |
| Hardware Specification | Yes | For efficiency, we measure the latency on a single A100-80GB GPU (PCIe) [45] and estimate the energy costs using an Eyeriss-like hardware accelerator [8, 75]. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers. |
| Experiment Setup | Yes | We consider five representative SOTA LLM families, including OPT [74], LLa MA-1/2/3 [58, 2], Gemma [42], Mistral [31], and Bloom [49]. Tasks and Datasets. We evaluate all five LLMs on the commonly adopted language modeling task using the Wiki Text-2 [41] dataset for perplexity measurement. Additionally, we extend the evaluation of the two largest models, OPT-66B and LLa MA-2-70B, to eight downstream tasks for zero-shot accuracy evaluation. Baselines. We consider four SOTA LLM quantization methods: OPTQ [18], LUTGEMM [48], Qu IP [6], and AWQ [38]. Evaluation Metrics. We evaluate Shift Add LLM and the baselines using both accuracy and efficiency metrics. For accuracy, we evaluate perplexity on the Wiki Text-2 dataset and zero-shot accuracy on eight downstream tasks. For efficiency, we measure the latency on a single A100-80GB GPU (PCIe) [45] and estimate the energy costs using an Eyeriss-like hardware accelerator [8, 75]. Note that we set the group size of all methods as the length of rows following the setting of OPTQ [18] for a fair comparison. |