Towards Fully 8-bit Integer Inference for the Transformer Model
Authors: Ye Lin, Yanyang Li, Tengbo Liu, Tong Xiao, Tongran Liu, Jingbo Zhu
IJCAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments on WMT16 En Ro, WMT14 En De and En Fr translation tasks as well as the Wiki Text-103 language modelling task show that the fully 8-bit Transformer system achieves comparable performance with the floating point baseline but requires nearly 4 less memory footprint. |
| Researcher Affiliation | Collaboration | 1Natural Language Processing Lab., Northeastern University, Shenyang, China 2Niu Trans Research, Shenyang, China 3CAS Key Laboratory of Behavioral Science, Institute of Psychology, CAS, Beijing, China |
| Pseudocode | Yes | Algorithm 1 SCALE PROPAGATION PROTOCOL Input: Operation OP; INT8 Tensors x1...n; Scales s1...n Output: INT8 Tensor x; Scale s |
| Open Source Code | No | The paper does not provide concrete access to source code for the methodology described, nor does it include a specific repository link or explicit code release statement. |
| Open Datasets | Yes | We evaluate our methods on three machine translation (MT) tasks and a language modelling (LM) task, including the WMT16 English-Roman (En Ro), the WMT14 English German (En De), the WMT14 English-French (En Fr) and the Wiki Text-103 LM tasks. |
| Dataset Splits | Yes | For En Ro (610K pairs), we use newsdev-2016 and newstest-2016 as the validation and test sets respectively. For En De (4.5M pairs), newstest-2013 is the validation set and newstest-2014 is the test set. For En Fr (36M pairs), we validate the system on the combination of newstest-2012 and newstest-2013, and test it on newstest-2014. The Wiki Text-103 dataset contains a training set of 103 million words. Both the validation and test sets contain 0.2 million words. |
| Hardware Specification | Yes | All experiments are run on 8 NVIDIA TITAN V GPUs. |
| Software Dependencies | No | The paper mentions software tools like "Adam optimizer", "Re LU activation", "Moses", and "byte-pair encoding" but does not specify any version numbers for these or other software libraries/frameworks. |
| Experiment Setup | Yes | For training, we use Adam optimizer with β1 = 0.9 and β2 = 0.997. We adopt the inverse square root learning rate schedule with 8K warmup steps and the learning rate = 0.001/0.0007 for Transformer-base/big. The embedding size is set to 512 for Transformer-base and 1,024 for Transformer-big. The number of heads is 8/16 for Transformer-base/big. The hidden size equals to 4 embedding size in both settings. For the lm-big training, we use the Nesterov s accelerated gradient. We adopt the cosine learning rate schedule with 16K warmup steps and the maximum learning rate 1. |