An Efficient Transformer Decoder with Compressed Sub-layers
Authors: Yanyang Li, Ye Lin, Tong Xiao, Jingbo Zhu13315-13323
AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on 14 WMT machine translation tasks show that our model is 1.42 faster with performance on par with a strong baseline. |
| Researcher Affiliation | Collaboration | Yanyang Li1*, Ye Lin1 , Tong Xiao1,2 , Jingbo Zhu1,2 1NLP Lab, School of Computer Science and Engineering, Northeastern University, Shenyang, China 2Niu Trans Research, Shenyang, China {blamedrlee, linye2015}@outlook.com, {xiaotong,zhujingbo}@mail.neu.edu.cn |
| Pseudocode | No | No structured pseudocode or algorithm blocks were found in the paper. |
| Open Source Code | No | The paper states: 'Our baseline system is based on the open-source implementation of the Transformer model presented in Ott et al. (2019).' however, it does not explicitly provide a link to the authors' own code for CAN or state that their code will be released. |
| Open Datasets | Yes | Datasets We evaluate our methods on 14 machine translation tasks (7 datasets 2 translation directions each), including WMT14 En {De, Fr} and WMT17 En {De, Fi, Lv, Ru, Cs}. WMT14 En {De, Fr} datasets are tokenized by a script from Moses1. We apply BPE (Sennrich, Haddow, and Birch 2016) with 32K merge operations to segment words into subword units. Sentences with more than 250 subword units are removed. The first two rows of Table 1 are the detailed statistics of these two datasets. For En-De, we share the source and target vocabularies. We choose newstest-2013 as the validation set and newstest-2014 as the test set. For En Fr, we validate the system on the combination of newstest2012 and newstest-2013, and test it on newstest-2014. All WMT17 datasets are the official preprocessed version from WMT17 website2. |
| Dataset Splits | Yes | For En-De, we choose newstest-2013 as the validation set and newstest-2014 as the test set. For En Fr, we validate the system on the combination of newstest2012 and newstest-2013, and test it on newstest-2014. All WMT17 datasets are the official preprocessed version from WMT17 website2. BPE with 32K merge operations is similarly applied to these datasets. We use the concatenation of all available preprocessed validation sets in WMT17 datasets as our validation set: En De. We use the concatenation of newstest2014, newstest2015 and newstest2016 as the validation set. En Fi. We use the concatenation of newstest2015, newsdev2015, newstest2016 and newstest B2016 as the validation set. En Lv. We use newsdev2016 as the validation set. En Ru. We use the concatenation of newstest2014, newstest2015 and newstest2016 as the validation set. En Cs. We use the concatenation of newstest2014, newstest2015 and newstest2016 as the validation set. |
| Hardware Specification | Yes | All systems are trained on 8 NVIDIA TITIAN V GPUs with mixed-precision training (Micikevicius et al. 2018) and a batch size of 4,096 tokens per GPU. |
| Software Dependencies | No | The paper mentions 'Our baseline system is based on the open-source implementation of the Transformer model presented in Ott et al. (2019)', but does not provide specific version numbers for software dependencies like PyTorch, Python, or specific libraries. |
| Experiment Setup | Yes | The embedding size is set to 512. The number of attention heads is 8. The FFN hidden size equals to 4 embedding size. Dropout with the value of 0.1 is used for regularization. We adopt the inverse square root learning rate schedule with 8,000 warmup steps and 0.0007 learning rate. We stop training until the model stops improving on the validation set. All systems are trained on 8 NVIDIA TITIAN V GPUs with mixed-precision training (Micikevicius et al. 2018) and a batch size of 4,096 tokens per GPU. We average model parameters in the last 5 epochs for better performance. At test time, the model is decoded with a beam of width 4 and half-precision. For an accurate speed comparison, we decode with a batch size of 1 to avoid paddings. |