reproducibilityindex.ai

ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization

Authors: Haoran You, Yipin Guo, Yichao Fu, Wei Zhou, Huihong Shi, Xiaofan Zhang, Souvik Kundu, Amir Yazdanbakhsh, Yingyan (Celine) Lin

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on five LLM families and eight tasks consistently validate the effectiveness of Shift Add LLM, achieving average perplexity reductions of 5.6 and 22.7 points at comparable or lower latency compared to the most competitive quantized LLMs at 3and 2-bit precision, respectively, and more than 80% memory and energy reductions over the original LLMs.
Researcher Affiliation	Collaboration	Haoran You , Yipin Guo , Yichao Fu , Wei Zhou , Huihong Shi , Xiaofan Zhang Souvik Kundu , Amir Yazdanbakhsh , Yingyan (Celine) Lin Georgia Institute of Technology Intel Labs Google Google Deep Mind
Pseudocode	Yes	Algorithm 1 Alternating Multi-bit BCQ [64]
Open Source Code	Yes	Codes and models are available at https://github.com/GATECH-EIC/ Shift Add LLM.
Open Datasets	Yes	Tasks and Datasets. We evaluate all five LLMs on the commonly adopted language modeling task using the Wiki Text-2 [41] dataset for perplexity measurement. Additionally, we extend the evaluation of the two largest models, OPT-66B and LLa MA-2-70B, to eight downstream tasks for zero-shot accuracy evaluation. These tasks include ARC (Challenge/Easy) [4], Bool Q [9], Copa [1], PIQA [56], RTE [11], Story Cloze [43], and MMLU [26].
Dataset Splits	No	The paper states using Wiki Text-2 for perplexity measurement and downstream tasks for zero-shot accuracy evaluation, but it does not explicitly provide the train/test/validation dataset splits with percentages, sample counts, or specific predefined split citations.
Hardware Specification	Yes	For efficiency, we measure the latency on a single A100-80GB GPU (PCIe) [45] and estimate the energy costs using an Eyeriss-like hardware accelerator [8, 75].
Software Dependencies	No	The paper does not provide specific software dependencies with version numbers.
Experiment Setup	Yes	We consider five representative SOTA LLM families, including OPT [74], LLa MA-1/2/3 [58, 2], Gemma [42], Mistral [31], and Bloom [49]. Tasks and Datasets. We evaluate all five LLMs on the commonly adopted language modeling task using the Wiki Text-2 [41] dataset for perplexity measurement. Additionally, we extend the evaluation of the two largest models, OPT-66B and LLa MA-2-70B, to eight downstream tasks for zero-shot accuracy evaluation. Baselines. We consider four SOTA LLM quantization methods: OPTQ [18], LUTGEMM [48], Qu IP [6], and AWQ [38]. Evaluation Metrics. We evaluate Shift Add LLM and the baselines using both accuracy and efficiency metrics. For accuracy, we evaluate perplexity on the Wiki Text-2 dataset and zero-shot accuracy on eight downstream tasks. For efficiency, we measure the latency on a single A100-80GB GPU (PCIe) [45] and estimate the energy costs using an Eyeriss-like hardware accelerator [8, 75]. Note that we set the group size of all methods as the length of rows following the setting of OPTQ [18] for a fair comparison.