Fuzzy Alignments in Directed Acyclic Graph for Non-Autoregressive Machine Translation

Authors: Zhengrui Ma, Chenze Shao, Shangtong Gui, Min Zhang, Yang Feng

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on major WMT benchmarks show that our method substantially improves translation performance and increases prediction confidence, setting a new state of the art for NAT on the raw training data.
Researcher Affiliation Academia Zhengrui Ma1,2, Chenze Shao1,2, Shangtong Gui1,2, Min Zhang3 & Yang Feng1,2 1 Key Laboratory of Intelligent Information Processing Institute of Computing Technology, Chinese Academy of Sciences 2 University of Chinese Academy of Sciences 3 Harbin Institute of Technology, Shenzhen
Pseudocode Yes Algorithm 1 Calculation of Ey [Cg(y )] and Ey [P g Gn(y ) Cg(y )]
Open Source Code Yes 1Source code: https://github.com/ictnlp/FA-DAT.
Open Datasets Yes Datasets We conduct experiments on two major benchmarks that are widely used in previous studies: WMT14 English German (EN DE, 4M) and WMT17 Chinese English (ZH EN, 20M). Newstest2013 as the validation set and newstest2014 as the test set for EN DE; devtest2017 as the validation set and newstest2017 as the test set for ZH EN.
Dataset Splits Yes Newstest2013 as the validation set and newstest2014 as the test set for EN DE; devtest2017 as the validation set and newstest2017 as the test set for ZH EN.
Hardware Specification Yes All the experiments are conducted on Ge Force RTX 3090 GPUs.
Software Dependencies No The paper mentions implementing models with 'open-source toolkit fairseq (Ott et al., 2019)' but does not provide specific version numbers for fairseq or other software dependencies.
Experiment Setup Yes During both pretraining and finetuning, we set dropout rate to 0.1, weight decay to 0.01, and no label smoothing is applied. In pretraining, all models are trained for 300k updates with a batch size of 64k tokens. The learning rate warms up to 5 * 10^-4 within 10k steps. In finetuning, we use the batch of 256k tokens to stabilize the gradients and train models for 5k updates. The learning rate warms up to 2 * 10^-4 within 500 steps.