ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization

Authors: Haoran You, Yipin Guo, Yichao Fu, Wei Zhou, Huihong Shi, Xiaofan Zhang, Souvik Kundu, Amir Yazdanbakhsh, Yingyan (Celine) Lin

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on five LLM families and eight tasks consistently validate the effectiveness of Shift Add LLM, achieving average perplexity reductions of 5.6 and 22.7 points at comparable or lower latency compared to the most competitive quantized LLMs at 3and 2-bit precision, respectively, and more than 80% memory and energy reductions over the original LLMs.
Researcher Affiliation Collaboration Haoran You , Yipin Guo , Yichao Fu , Wei Zhou , Huihong Shi , Xiaofan Zhang Souvik Kundu , Amir Yazdanbakhsh , Yingyan (Celine) Lin Georgia Institute of Technology Intel Labs Google Google Deep Mind
Pseudocode Yes Algorithm 1 Alternating Multi-bit BCQ [64]
Open Source Code Yes Codes and models are available at https://github.com/GATECH-EIC/ Shift Add LLM.
Open Datasets Yes Tasks and Datasets. We evaluate all five LLMs on the commonly adopted language modeling task using the Wiki Text-2 [41] dataset for perplexity measurement. Additionally, we extend the evaluation of the two largest models, OPT-66B and LLa MA-2-70B, to eight downstream tasks for zero-shot accuracy evaluation. These tasks include ARC (Challenge/Easy) [4], Bool Q [9], Copa [1], PIQA [56], RTE [11], Story Cloze [43], and MMLU [26].
Dataset Splits No The paper states using Wiki Text-2 for perplexity measurement and downstream tasks for zero-shot accuracy evaluation, but it does not explicitly provide the train/test/validation dataset splits with percentages, sample counts, or specific predefined split citations.
Hardware Specification Yes For efficiency, we measure the latency on a single A100-80GB GPU (PCIe) [45] and estimate the energy costs using an Eyeriss-like hardware accelerator [8, 75].
Software Dependencies No The paper does not provide specific software dependencies with version numbers.
Experiment Setup Yes We consider five representative SOTA LLM families, including OPT [74], LLa MA-1/2/3 [58, 2], Gemma [42], Mistral [31], and Bloom [49]. Tasks and Datasets. We evaluate all five LLMs on the commonly adopted language modeling task using the Wiki Text-2 [41] dataset for perplexity measurement. Additionally, we extend the evaluation of the two largest models, OPT-66B and LLa MA-2-70B, to eight downstream tasks for zero-shot accuracy evaluation. Baselines. We consider four SOTA LLM quantization methods: OPTQ [18], LUTGEMM [48], Qu IP [6], and AWQ [38]. Evaluation Metrics. We evaluate Shift Add LLM and the baselines using both accuracy and efficiency metrics. For accuracy, we evaluate perplexity on the Wiki Text-2 dataset and zero-shot accuracy on eight downstream tasks. For efficiency, we measure the latency on a single A100-80GB GPU (PCIe) [45] and estimate the energy costs using an Eyeriss-like hardware accelerator [8, 75]. Note that we set the group size of all methods as the length of rows following the setting of OPTQ [18] for a fair comparison.