Towards Next-Level Post-Training Quantization of Hyper-Scale Transformers

Authors: Junhan Kim, Chungman Lee, Eulrang Cho, Kyungphil Park, Ho-young Kim, Joonyoung Kim, Yongkweon Jeon

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive experiments on various language models and complexity analysis, we demonstrate that aespa is accurate and efficient in quantizing Transformer models.
Researcher Affiliation Industry Junhan Kim , Chungman Lee , Eulrang Cho, Kyungphil Park, Ho-young Kim, Joonyoung Kim, Yongkweon Jeon Samsung Research {jun_one.kim, chungman.lee, dragwon.jeon}@samsung.com
Pseudocode Yes In Appendix A, we provide the pseudo-code for the proposed aespa excluded in the main text due to the page limitation. Algorithm 1 Quantization
Open Source Code Yes The code will be available at https://github.com/Samsung Labs/aespa.
Open Datasets Yes When constructing the calibration dataset, we randomly sample 128 segments consisting of 2048 tokens from the C4 dataset [24] as in [7, 13, 3].
Dataset Splits No The paper mentions 'calibration dataset' and 'benchmark datasets (e.g., Wiki Text-2 [22], C4 [24], and PTB [21])' but does not specify explicit training/validation/test splits (e.g., percentages or sample counts) for the evaluation data.
Hardware Specification Yes Except for the experiments on the LLa MA2 models, which were performed using an NVIDIA H100 GPU, we conducted all experiments using a single NVIDIA A100 GPU (80 GB).
Software Dependencies No The paper mentions using 'Z-FOLD [13]' and 'Ada Round [23]' for implementing aespa, but does not specify software versions for these or any other libraries/environments (e.g., Python, PyTorch versions).
Experiment Setup Yes When optimizing a weight-rounding policy, we set the number of iterations, learning rate, and weight of the rounding loss (see λ in (28)) to 2,000, 0.015, and 1.5, respectively.