An Efficient Transformer Decoder with Compressed Sub-layers

Authors: Yanyang Li, Ye Lin, Tong Xiao, Jingbo Zhu13315-13323

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on 14 WMT machine translation tasks show that our model is 1.42 faster with performance on par with a strong baseline.
Researcher Affiliation Collaboration Yanyang Li1*, Ye Lin1 , Tong Xiao1,2 , Jingbo Zhu1,2 1NLP Lab, School of Computer Science and Engineering, Northeastern University, Shenyang, China 2Niu Trans Research, Shenyang, China {blamedrlee, linye2015}@outlook.com, {xiaotong,zhujingbo}@mail.neu.edu.cn
Pseudocode No No structured pseudocode or algorithm blocks were found in the paper.
Open Source Code No The paper states: 'Our baseline system is based on the open-source implementation of the Transformer model presented in Ott et al. (2019).' however, it does not explicitly provide a link to the authors' own code for CAN or state that their code will be released.
Open Datasets Yes Datasets We evaluate our methods on 14 machine translation tasks (7 datasets 2 translation directions each), including WMT14 En {De, Fr} and WMT17 En {De, Fi, Lv, Ru, Cs}. WMT14 En {De, Fr} datasets are tokenized by a script from Moses1. We apply BPE (Sennrich, Haddow, and Birch 2016) with 32K merge operations to segment words into subword units. Sentences with more than 250 subword units are removed. The first two rows of Table 1 are the detailed statistics of these two datasets. For En-De, we share the source and target vocabularies. We choose newstest-2013 as the validation set and newstest-2014 as the test set. For En Fr, we validate the system on the combination of newstest2012 and newstest-2013, and test it on newstest-2014. All WMT17 datasets are the official preprocessed version from WMT17 website2.
Dataset Splits Yes For En-De, we choose newstest-2013 as the validation set and newstest-2014 as the test set. For En Fr, we validate the system on the combination of newstest2012 and newstest-2013, and test it on newstest-2014. All WMT17 datasets are the official preprocessed version from WMT17 website2. BPE with 32K merge operations is similarly applied to these datasets. We use the concatenation of all available preprocessed validation sets in WMT17 datasets as our validation set: En De. We use the concatenation of newstest2014, newstest2015 and newstest2016 as the validation set. En Fi. We use the concatenation of newstest2015, newsdev2015, newstest2016 and newstest B2016 as the validation set. En Lv. We use newsdev2016 as the validation set. En Ru. We use the concatenation of newstest2014, newstest2015 and newstest2016 as the validation set. En Cs. We use the concatenation of newstest2014, newstest2015 and newstest2016 as the validation set.
Hardware Specification Yes All systems are trained on 8 NVIDIA TITIAN V GPUs with mixed-precision training (Micikevicius et al. 2018) and a batch size of 4,096 tokens per GPU.
Software Dependencies No The paper mentions 'Our baseline system is based on the open-source implementation of the Transformer model presented in Ott et al. (2019)', but does not provide specific version numbers for software dependencies like PyTorch, Python, or specific libraries.
Experiment Setup Yes The embedding size is set to 512. The number of attention heads is 8. The FFN hidden size equals to 4 embedding size. Dropout with the value of 0.1 is used for regularization. We adopt the inverse square root learning rate schedule with 8,000 warmup steps and 0.0007 learning rate. We stop training until the model stops improving on the validation set. All systems are trained on 8 NVIDIA TITIAN V GPUs with mixed-precision training (Micikevicius et al. 2018) and a batch size of 4,096 tokens per GPU. We average model parameters in the last 5 epochs for better performance. At test time, the model is decoded with a beam of width 4 and half-precision. For an accurate speed comparison, we decode with a batch size of 1 to avoid paddings.