Incorporating BERT into Parallel Sequence Decoding with Adapters

Authors: Junliang Guo, Zhirui Zhang, Linli Xu, Hao-Ran Wei, Boxing Chen, Enhong Chen

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive experiments on neural machine translation tasks where the proposed method consistently outperforms autoregressive baselines while reducing the inference latency by half, and achieves 36.49/33.57 BLEU scores on IWSLT14 German-English/WMT14 German-English translation.
Researcher Affiliation Collaboration Junliang Guo1, Zhirui Zhang2, Linli Xu1,3 , Hao-Ran Wei2, Boxing Chen2, Enhong Chen1 1Anhui Province Key Laboratory of Big Data Analysis and Application, School of Computer Science and Technology, University of Science and Technology of China 2Alibaba DAMO Academy 3IFLYTEK Co., Ltd.
Pseudocode No The paper mentions 'Details of the decoding algorithm are provided in Appendix B.' but Appendix B or any explicit pseudocode/algorithm block is not included in the provided text.
Open Source Code Yes Our implementation is based on fairseq and is available at https://github.com/lemmonation/abnet.
Open Datasets Yes We evaluate our framework on benchmark datasets including IWSLT14 German English (IWSLT14 De-En)2, WMT14 English German translation (WMT14 En-De/De En)3, and WMT16 Romanian English (WMT16 Ro-En)4. We show the generality of our method on several low-resource datasets including IWSLT14 English Italian/Spanish/Dutch (IWSLT14 En It/Es/Nl).
Dataset Splits Yes For IWSLT14 tasks, we adopt the official split of train/valid/test sets. For WMT14 tasks, we utilize newstest2013 and newstest2014 as the validation and test set respectively. For WMT16 tasks, we use newsdev2016 and newstest2016 as the validation and test set.
Hardware Specification Yes We train our framework on 1/8 Nvidia 1080Ti GPUs for IWSLT14/WMT tasks, and it takes 1/7 days to finish training.
Software Dependencies No The paper states 'Our implementation is based on fairseq' but does not provide specific version numbers for fairseq or other ancillary software dependencies like Python, PyTorch, or CUDA.
Experiment Setup Yes We mainly build our framework on bert-base models (nlayers = 12, nheads = 12, dhidden = 768, d FFN = 3072). Specifically, for English we use bert-base-uncased on IWSLT14 and bert-base-cased on WMT tasks. We use bert-base-german-cased for German and bert-base-multilingual-cased for all other languages. When extending to autoregressive decoding, we utilize bert-large-cased (nlayers = 24, nheads = 16, dhidden = 1024, d FFN = 4096) for English to keep consistency with [41]. For adapters, on the encoder side, we set the hidden dimension between two FFN layers as d Aenc = 2048 for WMT tasks and 512 for IWSLT14 tasks. On the decoder side, the hidden dimension of the cross-attention module is set equal to the hidden dimension of BERT models, i.e., d Adec = 768 for bert-base models and d Adec = 1024 for bert-large models. ... While inference, we generate multiple translation candidates by taking the top B length predictions into consideration, and select the translation with the highest probability as the final result. We set B = 4 for all tasks. And the upper bound of iterative decoding is set to 10. For autoregressive decoding, we use beam search with width 5 for all tasks.