Aligned Cross Entropy for Non-Autoregressive Machine Translation

Authors: Marjan Ghazvininejad, Vladimir Karpukhin, Luke Zettlemoyer, Omer Levy

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on machine translation benchmarks demonstrate that AXE substantially boosts the performance of CMLMs, while having the same decoding speed.
Researcher Affiliation Industry 1Facebook AI Research. Correspondence to: Marjan Ghazvininejad <ghazvini@fb.com>.
Pseudocode Yes Algorithm 1 Aligned Cross Entropy
Open Source Code No The paper does not provide any statement about releasing source code or a link to a code repository.
Open Datasets Yes We evaluate our method on both directions of three standard machine translation datasets with various training data sizes: WMT 14 English German (4.5M sentence pairs), WMT 16 English-Romanian (610k pairs), and WMT 17 English-Chinese (20M pairs). The datasets are tokenized into subword units using BPE (Sennrich et al., 2016). We use the same data and preprocessing as Vaswani et al. (2017), Lee et al. (2018), and Wu et al. (2019) for WMT 14 EN-DE, WMT 16 EN-RO, and WMT 17 EN-ZH respectively.
Dataset Splits Yes We measure the validation loss at the end of each epoch, and average the 5 best checkpoints based on their validation loss to create the final model. ... Similarly we use ℓ= 5 length candidates for CMLM models, tune the length multiplier (λ {1.05 . . . 1.1}), and the target skipping penalty (δ {1 . . . 5}) on the validation set.
Hardware Specification Yes We train all models with mixed precision floating point arithmetic on 16 Nvidia V100 GPUs.
Software Dependencies No The paper mentions using Adam (Kingma & Ba, 2015) for optimization and BERT (Devlin et al., 2018) for weight initialization, which are methods/models. It does not provide specific version numbers for software libraries or environments (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes We generally follow the transformer base hyperparameters (Vaswani et al., 2017): 6 layers for the encoder and decoder, 8 attention heads per layer, 512 model dimensions, and 2048 hidden dimensions. ... For regularization, we set dropout to 0.3, and use 0.01 L2 weight decay and label smoothing with ε = 0.1. We train batches of 128k tokens using Adam (Kingma & Ba, 2015) with β = (0.9, 0.999) and ε = 10 6. The learning rate warms up to 5 10 4 within 10k steps, and then decays with the inverse square-root schedule. We train all models for 300k steps.