IM-Unpack: Training and Inference with Arbitrarily Low Precision Integers

Authors: Zhanpeng Zeng, Karthikeyan Sankaralingam, Vikas Singh

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we first verify that when the low bit-width restriction is removed, for a variety of Transformer-based models, integers are, in fact, sufficient for all GEMMs need for both training and inference stages, and achieve parity (with floating point). No sophisticated techniques are needed. We find that while a large majority of entries in matrices (encountered in such models) can be easily represented by low bit-width integers, the existence of a few heavy hitter entries make it difficult to achieve efficiency gains via the exclusive use of low bit-width GEMMs alone. To address this issue, we develop a simple algorithm, Integer Matrix Unpacking (IM-Unpack), to unpack a matrix with large integer entries into a larger matrix whose entries all lie within the representable range of arbitrarily low bit-width integers. This allows equivalence with the original GEMM, i.e., the exact result can be obtained using purely low bit-width integer GEMMs. This comes at the cost of additional operations we show that for many popular models, this overhead is quite small. Code is available at https://github. com/vsingh-group/im-unpack.
Researcher Affiliation Collaboration Zhanpeng Zeng 1 Karthikeyan Sankaralingam 1 2 Vikas Singh 1 1University of Wisconsin Madison 2NVIDIA Research. Correspondence to: Zhanpeng Zeng <zzeng38@wisc.edu>.
Pseudocode Yes Algorithm 1 Unpack Row(A,b), Algorithm 2 Unpack Column(A,B,S,b), Algorithm 3 Scaled Mat Mul(A,B,S), Algorithm 4 Unpack Both(A,B,S,b), Algorithm 5 Unpack(A,B,S,b,strategy)
Open Source Code Yes Code is available at https://github. com/vsingh-group/im-unpack.
Open Datasets Yes To limit the amount of compute but still gather useful feedback, we evaluate RTN on Ro BERTa (Liu et al., 2019) pretraining using masked language modeling (Devlin et al., 2019) on the English Wikipedia corpus (Foundation) and Image Net classification (Deng et al., 2009) using Vi T (Dosovitskiy et al., 2021) (and see T5-Large (Raffel et al., 2020) finetuning in A.3).
Dataset Splits No The paper mentions 'validation log perplexity' and 'validation top-1 accuracy' but does not explicitly state the dataset splits (e.g., 80/10/10 split) used for training, validation, and testing.
Hardware Specification Yes We run all of our experiments on NVIDIA RTX 3090.
Software Dependencies No The paper mentions software like 'timm', 'Py Torch', and 'CUTLASS' but does not provide specific version numbers for these dependencies.
Experiment Setup Yes All hyperparameters (including random seed) are the same for full-precision and RTN quantized training. See A.3 for more details of training configurations. ... We use an Adam W optimizer with 1e-4 learning rate, 10,000 warm-up steps, 0.01 weight decay, and linear decay. ... The hyperparameters of all experiments are the same: batch size 1024, optimizer Adam W, learning rate 0.001, weight decay 0.05, augmentation rand-m9-mstd0.5-inc1, mixup 0.8, cutmix 1.0.