Incorporating BERT into Parallel Sequence Decoding with Adapters
Authors: Junliang Guo, Zhirui Zhang, Linli Xu, Hao-Ran Wei, Boxing Chen, Enhong Chen
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive experiments on neural machine translation tasks where the proposed method consistently outperforms autoregressive baselines while reducing the inference latency by half, and achieves 36.49/33.57 BLEU scores on IWSLT14 German-English/WMT14 German-English translation. |
| Researcher Affiliation | Collaboration | Junliang Guo1, Zhirui Zhang2, Linli Xu1,3 , Hao-Ran Wei2, Boxing Chen2, Enhong Chen1 1Anhui Province Key Laboratory of Big Data Analysis and Application, School of Computer Science and Technology, University of Science and Technology of China 2Alibaba DAMO Academy 3IFLYTEK Co., Ltd. |
| Pseudocode | No | The paper mentions 'Details of the decoding algorithm are provided in Appendix B.' but Appendix B or any explicit pseudocode/algorithm block is not included in the provided text. |
| Open Source Code | Yes | Our implementation is based on fairseq and is available at https://github.com/lemmonation/abnet. |
| Open Datasets | Yes | We evaluate our framework on benchmark datasets including IWSLT14 German English (IWSLT14 De-En)2, WMT14 English German translation (WMT14 En-De/De En)3, and WMT16 Romanian English (WMT16 Ro-En)4. We show the generality of our method on several low-resource datasets including IWSLT14 English Italian/Spanish/Dutch (IWSLT14 En It/Es/Nl). |
| Dataset Splits | Yes | For IWSLT14 tasks, we adopt the official split of train/valid/test sets. For WMT14 tasks, we utilize newstest2013 and newstest2014 as the validation and test set respectively. For WMT16 tasks, we use newsdev2016 and newstest2016 as the validation and test set. |
| Hardware Specification | Yes | We train our framework on 1/8 Nvidia 1080Ti GPUs for IWSLT14/WMT tasks, and it takes 1/7 days to finish training. |
| Software Dependencies | No | The paper states 'Our implementation is based on fairseq' but does not provide specific version numbers for fairseq or other ancillary software dependencies like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | We mainly build our framework on bert-base models (nlayers = 12, nheads = 12, dhidden = 768, d FFN = 3072). Specifically, for English we use bert-base-uncased on IWSLT14 and bert-base-cased on WMT tasks. We use bert-base-german-cased for German and bert-base-multilingual-cased for all other languages. When extending to autoregressive decoding, we utilize bert-large-cased (nlayers = 24, nheads = 16, dhidden = 1024, d FFN = 4096) for English to keep consistency with [41]. For adapters, on the encoder side, we set the hidden dimension between two FFN layers as d Aenc = 2048 for WMT tasks and 512 for IWSLT14 tasks. On the decoder side, the hidden dimension of the cross-attention module is set equal to the hidden dimension of BERT models, i.e., d Adec = 768 for bert-base models and d Adec = 1024 for bert-large models. ... While inference, we generate multiple translation candidates by taking the top B length predictions into consideration, and select the translation with the highest probability as the final result. We set B = 4 for all tasks. And the upper bound of iterative decoding is set to 10. For autoregressive decoding, we use beam search with width 5 for all tasks. |