reproducibilityindex.ai

MagR: Weight Magnitude Reduction for Enhancing Post-Training Quantization

Authors: Aozhong Zhang, Naigang Wang, Yanxia Deng, Xin Li, Zi Yang, Penghang Yin

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments demonstrate that Mag R achieves state-of-the-art performance on the Llama family of models. For example, we achieve a Wikitext2 perplexity of 5.95 on the LLa MA2-70B model for per-channel INT2 weight quantization without incurring any inference overhead. The code is available at https://github.com/Aozhong Zhang/Mag R
Researcher Affiliation	Collaboration	Aozhong Zhang1 Naigang Wang2 Yanxia Deng1 Xin Li1 Zi Yang1 Penghang Yin1 1University at Albany, SUNY 2 IBM T. J. Watson Research Center
Pseudocode	Yes	Algorithm 1 Per-channel Mag R for one linear layer. Algorithm 2 Column-wise projection onto the unit ℓ1-ball. Algorithm 3 Projection onto ℓ1-ball.
Open Source Code	Yes	The code is available at https://github.com/Aozhong Zhang/Mag R
Open Datasets	Yes	Following the previous work [16, 25, 36], we evaluate the quantized model on language generation tasks on Wiki Text2 [28] and C4 [33].
Dataset Splits	No	The paper does not explicitly provide specific percentages or sample counts for training, validation, or test splits for the datasets (Wikitext2, C4). It mentions using "calibration data" to obtain the input matrix X, but this is not a formal dataset split.
Hardware Specification	Yes	We utilized the Hugging Face implementations of the LLa MA1 and LLa MA2 models and perform quantization on a single NVIDIA A100 GPU with 80GB of memory.
Software Dependencies	No	For the language generation experiments, our implement is based on the OPTQ s [16] repository, which is built using Py Torch. For executing all zero-shot tasks, we adhere to the lm-eval-harness [17]. (Specific version numbers for PyTorch or lm-eval-harness are not provided.)
Experiment Setup	Yes	The choice of parameters. To ensure that the Mag R-processed layer output XW is faithful to the original X ˆ W , we need to use a tiny penalty parameter α in (2). For per-channel quantization, α was fixed to be 10 3 in our experiments... Furthermore, we used a multiplicative scalar β < 1 to decay the standard quantization step δ = max(w) min(w) / (2b 1)... In addition, the iteration number K in Algorithm 1 was set to 150 across all the experiments.