Non-Monotonic Latent Alignments for CTC-Based Non-Autoregressive Machine Translation
Authors: Chenze Shao, Yang Feng
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on major WMT benchmarks show that our method substantially improves the translation performance of CTC-based models. Our best model achieves 30.06 BLEU on WMT14 En-De with only one-iteration decoding, closing the gap between non-autoregressive and autoregressive models.2 |
| Researcher Affiliation | Academia | Chenze Shao1,2, Yang Feng1,2 1Key Laboratory of Intelligent Information Processing Institute of Computing Technology, Chinese Academy of Sciences 2University of Chinese Academy of Sciences |
| Pseudocode | No | The paper does not contain any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | Yes | Source code: https://github.com/ictnlp/NMLA-NAT. |
| Open Datasets | Yes | Datasets We evaluate our methods on the most widely used public benchmarks in previous NAT studies: WMT14 English$German (En$De, 4.5M sentence pairs) [5] and WMT16 English$Romanian (En$Ro, 0.6M sentence pairs) [6]. |
| Dataset Splits | Yes | For WMT14 En$De, the validation set is newstest2013 and the test set is newstest2014. For WMT16 En$Ro, the validation set is newsdev-2016 and the test set is newstest-2016. |
| Hardware Specification | Yes | We use the Ge Force RTX 3090 GPU to train models and measure the translation latency. |
| Software Dependencies | No | We implement our models based on the open-source framework of fairseq [MIT License, 31]. |
| Experiment Setup | Yes | On WMT14 En$De, we use a dropout rate of 0.2 to train NAT models and use a dropout rate of 0.1 for finetuning. On WMT16 En$Ro, the dropout rate is 0.3 for both the pretraining and finetuning. We use the batch size 64K and train NAT models for 300K steps on WMT14 En$De and 150K steps on WMT16 En$Ro. During the finetuning, we train NAT models for 6K steps with the batch size 256K. All models are optimized with Adam [27] with β = (0.9, 0.98) and = 10 8. The learning rate warms up to 5 10 4 within 10K steps in the pretraining and warms up to e 10 4 within 500 steps in the finetuning. |