Duplex Sequence-to-Sequence Learning for Reversible Machine Translation

Authors: Zaixiang Zheng, Hao Zhou, Shujian Huang, Jiajun Chen, Jingjing Xu, Lei Li

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive experiments on standard machine translation benchmarks to inspect REDER s performance on seq2seq tasks. Experimental results show that the duplex idea indeed works: Overall, REDER achieves BLEU scores of 27.50 and 31.25 on standard WMT14 EN-DE and DE-EN benchmarks, respectively.
Researcher Affiliation Collaboration Zaixiang Zheng1,2 , Hao Zhou2, Shujian Huang1, Jiajun Chen1, Jingjing Xu2, Lei Li3 1National Key Laboratory for Novel Software Technology, Nanjing University 2Byte Dance AI Lab 3UC Santa Barbara
Pseudocode No The paper describes the model architecture and computations using mathematical formulas and diagrams (e.g., Figure 2 and equations for reversible layers) but does not include any explicit pseudocode or algorithm blocks.
Open Source Code Yes Code is available at https://github.com/zhengzx-nlp/REDER.
Open Datasets Yes We evaluate our proposal on two standard translation benchmarks, i.e., WMT14 English (EN) German (DE) (4.5M training pairs), and WMT16 English (EN) Romanian (RO) (610K training pairs).
Dataset Splits No The paper states 'We measure the validation BLEU scores for every 2,000 updates, and average the best 5 checkpoints to obtain the final model.' indicating a validation set was used, but it does not provide specific details on the size or split percentages of this validation set from the main datasets.
Hardware Specification Yes All models are trained for 300K updates using Nvidia V100 GPUs with a batch size of approximately 64K tokens. We train REDER on WMT14 EN DE using 8 32GB V100 GPUs for 432 hours (54 hour per GPU) and obtained a bidirectional translation model.
Software Dependencies No The paper states 'All models are implemented on fairseq [Ott et al., 2019]' and mentions 'an efficient library of C++ implementation' for CTC beam search, but it does not provide specific version numbers for these or other software dependencies like Python or PyTorch.
Experiment Setup Yes We design REDER based on the hyper-parameters of Transformerbase [Vaswani et al., 2017]. All models are implemented on fairseq [Ott et al., 2019]. REDER consists of 12 stacked layers. The number of head is 8, the model dimension is 512, and the inner dimension of FFN is 2048. For both AT and NAT models, we set the dropout rate 0.1 for WMT14 EN DE and WMT16 EN RO. We adopt weight decay with a decay rate 0.01 and label smoothing with ϵ = 0.1. By default, we upsample the source input by a factor of 2 for CTC-based models. We set λfba and λcc to 0.1 for all experiments. All models are trained for 300K updates using Nvidia V100 GPUs with a batch size of approximately 64K tokens.