Order-Agnostic Cross Entropy for Non-Autoregressive Machine Translation

Authors: Cunxiao Du, Zhaopeng Tu, Jing Jiang

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on major WMT benchmarks show that OAXE substantially improves translation performance, setting new state of the art for fully NAT models.
Researcher Affiliation Collaboration 1School of Computing and Information System, Singapore Management University, Singapore. Work was done when Cunxiao Du was under the Rhino-Bird Elite Training Program of Tencent AI Lab. 2Tencent AI Lab, China.
Pseudocode Yes We use Hungarian algorithm to efficiently implement OAXE (e.g., 7 lines of core code, see Appendix A.1)
Open Source Code Yes Our code, data, and trained models are available at https://github.com/ tencent-ailab/ICML21_OAXE.
Open Datasets Yes We conducted experiments on major benchmarking datasets that are widely-used in previous NAT studies (Gu et al., 2018; Shao et al., 2020; Ma et al., 2019; Saharia et al., 2020): WMT14 English German (En De, 4.5M sentence pairs), WMT16 English Romanian (En Ro, 0.6M sentence pairs). ... We use the dataset released by Ott et al. (2018) for evaluating translation uncertainty, which consists of ten human translations for 500 sentences taken from the WMT14 En-De test set.
Dataset Splits Yes The training set consists of 300K instances, in which the target is an ordering sampled from a given set of ordering modes from a categorical distribution. Both the validation and test sets consist of 3K instances and all the ordering modes serves as the references for the test sets.
Hardware Specification No The paper mentions that Hungarian Match was implemented with a CPU-version python package and that training speed is 1.36 times slower, but it does not provide specific details about the CPU or any other hardware components like GPU models, memory, or cloud instance types used for experiments.
Software Dependencies No The paper mentions software like 'python package scipy', 'PyTorch', 'Adam', and 'Fairseq' but does not provide specific version numbers for these dependencies.
Experiment Setup Yes We trained batches of approximately 128K tokens using Adam (Kingma & Ba, 2015). The learning rate warmed up to 5 10 4 in the first 10K steps, and then decayed with the inverse square-root schedule. We trained all models for 300k steps, measured the validation BLEU at the end of each epoch, and averaged the 5 best checkpoints.