Towards Fully 8-bit Integer Inference for the Transformer Model

Authors: Ye Lin, Yanyang Li, Tengbo Liu, Tong Xiao, Tongran Liu, Jingbo Zhu

IJCAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments on WMT16 En Ro, WMT14 En De and En Fr translation tasks as well as the Wiki Text-103 language modelling task show that the fully 8-bit Transformer system achieves comparable performance with the floating point baseline but requires nearly 4 less memory footprint.
Researcher Affiliation Collaboration 1Natural Language Processing Lab., Northeastern University, Shenyang, China 2Niu Trans Research, Shenyang, China 3CAS Key Laboratory of Behavioral Science, Institute of Psychology, CAS, Beijing, China
Pseudocode Yes Algorithm 1 SCALE PROPAGATION PROTOCOL Input: Operation OP; INT8 Tensors x1...n; Scales s1...n Output: INT8 Tensor x; Scale s
Open Source Code No The paper does not provide concrete access to source code for the methodology described, nor does it include a specific repository link or explicit code release statement.
Open Datasets Yes We evaluate our methods on three machine translation (MT) tasks and a language modelling (LM) task, including the WMT16 English-Roman (En Ro), the WMT14 English German (En De), the WMT14 English-French (En Fr) and the Wiki Text-103 LM tasks.
Dataset Splits Yes For En Ro (610K pairs), we use newsdev-2016 and newstest-2016 as the validation and test sets respectively. For En De (4.5M pairs), newstest-2013 is the validation set and newstest-2014 is the test set. For En Fr (36M pairs), we validate the system on the combination of newstest-2012 and newstest-2013, and test it on newstest-2014. The Wiki Text-103 dataset contains a training set of 103 million words. Both the validation and test sets contain 0.2 million words.
Hardware Specification Yes All experiments are run on 8 NVIDIA TITAN V GPUs.
Software Dependencies No The paper mentions software tools like "Adam optimizer", "Re LU activation", "Moses", and "byte-pair encoding" but does not specify any version numbers for these or other software libraries/frameworks.
Experiment Setup Yes For training, we use Adam optimizer with β1 = 0.9 and β2 = 0.997. We adopt the inverse square root learning rate schedule with 8K warmup steps and the learning rate = 0.001/0.0007 for Transformer-base/big. The embedding size is set to 512 for Transformer-base and 1,024 for Transformer-big. The number of heads is 8/16 for Transformer-base/big. The hidden size equals to 4 embedding size in both settings. For the lm-big training, we use the Nesterov s accelerated gradient. We adopt the cosine learning rate schedule with 16K warmup steps and the maximum learning rate 1.