MagR: Weight Magnitude Reduction for Enhancing Post-Training Quantization
Authors: Aozhong Zhang, Naigang Wang, Yanxia Deng, Xin Li, Zi Yang, Penghang Yin
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments demonstrate that Mag R achieves state-of-the-art performance on the Llama family of models. For example, we achieve a Wikitext2 perplexity of 5.95 on the LLa MA2-70B model for per-channel INT2 weight quantization without incurring any inference overhead. The code is available at https://github.com/Aozhong Zhang/Mag R |
| Researcher Affiliation | Collaboration | Aozhong Zhang1 Naigang Wang2 Yanxia Deng1 Xin Li1 Zi Yang1 Penghang Yin1 1University at Albany, SUNY 2 IBM T. J. Watson Research Center |
| Pseudocode | Yes | Algorithm 1 Per-channel Mag R for one linear layer. Algorithm 2 Column-wise projection onto the unit ℓ1-ball. Algorithm 3 Projection onto ℓ1-ball. |
| Open Source Code | Yes | The code is available at https://github.com/Aozhong Zhang/Mag R |
| Open Datasets | Yes | Following the previous work [16, 25, 36], we evaluate the quantized model on language generation tasks on Wiki Text2 [28] and C4 [33]. |
| Dataset Splits | No | The paper does not explicitly provide specific percentages or sample counts for training, validation, or test splits for the datasets (Wikitext2, C4). It mentions using "calibration data" to obtain the input matrix X, but this is not a formal dataset split. |
| Hardware Specification | Yes | We utilized the Hugging Face implementations of the LLa MA1 and LLa MA2 models and perform quantization on a single NVIDIA A100 GPU with 80GB of memory. |
| Software Dependencies | No | For the language generation experiments, our implement is based on the OPTQ s [16] repository, which is built using Py Torch. For executing all zero-shot tasks, we adhere to the lm-eval-harness [17]. (Specific version numbers for PyTorch or lm-eval-harness are not provided.) |
| Experiment Setup | Yes | The choice of parameters. To ensure that the Mag R-processed layer output XW is faithful to the original X ˆ W , we need to use a tiny penalty parameter α in (2). For per-channel quantization, α was fixed to be 10 3 in our experiments... Furthermore, we used a multiplicative scalar β < 1 to decay the standard quantization step δ = max(w) min(w) / (2b 1)... In addition, the iteration number K in Algorithm 1 was set to 150 across all the experiments. |