Training Transformers with 4-bit Integers

Authors: Haocheng Xi, ChangHao Li, Jianfei Chen, Jun Zhu

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our algorithm for training transformers on a wide variety of tasks, including natural language understanding, question answering, machine translation, and image classification. Our algorithm achieves competitive or superior accuracy compared with existing works on 4-bit training [47, 8].
Researcher Affiliation Collaboration 1Dept. of Comp. Sci. and Tech., Institute for AI, BNRist Center, THBI Lab, Tsinghua-Bosch Joint ML Center, Tsinghua University 2Institute for Interdisciplinary Information Sciences, Tsinghua University
Pseudocode Yes Procedure HQ-MM 1. Compute XH and H W in FP16. 2. Quantize the resultant matrices to INT4 by LSQ. 3. Multiply the two INT4 matrices. 4. Dequantize the resultant INT32 matrix to FP16 by multiplying s Xs W .
Open Source Code Yes Our code is available at https://github.com/xijiu9/Train_Transformers_with_INT4.
Open Datasets Yes We use the pretrained BERT-base-uncased and BERT-largeuncased [24] model, and evaluate the performance of our method on GLUE dev-set [53], SQUAD [41], SQUADv2 [40], Adversarial QA [4], Co NLL-2003 [42] and SWAG [61] datasets. We train a Transformer-base [52] model on WMT 14 En-De dataset [6] for machine translation. We load Vi T checkpoints pretrained on Image Net21k [13], and fine-tune it on CIFAR-10, CIFAR-100 [28], and Image Net1k.
Dataset Splits Yes We evaluate the performance of our method on GLUE dev-set [53].
Hardware Specification Yes On an Nvidia RTX 3090 GPU which has a peak throughput at 142 FP16 TFLOPs and 568 INT4 TFLOPs. We employed NVIDIA Ge Force RTX 3090 for running most of the experiments, while the NVIDIA A40 was utilized to evaluate the performance of BERT-Large and Vi T-L. Furthermore, we conducted runtime measurements using the NVIDIA T4, 3090, and A100 GPUs.
Software Dependencies No The paper mentions using "CUDA and cutlass" and "FP16 Py Torch AMP" but does not specify version numbers for these software components. It also refers to several GitHub repositories for model implementations (e.g., HuggingFace Transformers, Fairseq, ViT-pytorch, Deit) but without specific versioning for the libraries themselves.
Experiment Setup No The paper states: "We adopt default architectures, optimizers, schedulers, and hyper-parameters for all the evaluated models." While some details like varying hidden layer size, intermediate fully-connected layer size, and batch size are mentioned, and some specific epoch counts for re-initialization, a comprehensive list of hyperparameters (e.g., learning rates, specific optimizers used, or detailed training schedules) is not provided in the main text.