Aligned Cross Entropy for Non-Autoregressive Machine Translation
Authors: Marjan Ghazvininejad, Vladimir Karpukhin, Luke Zettlemoyer, Omer Levy
ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on machine translation benchmarks demonstrate that AXE substantially boosts the performance of CMLMs, while having the same decoding speed. |
| Researcher Affiliation | Industry | 1Facebook AI Research. Correspondence to: Marjan Ghazvininejad <ghazvini@fb.com>. |
| Pseudocode | Yes | Algorithm 1 Aligned Cross Entropy |
| Open Source Code | No | The paper does not provide any statement about releasing source code or a link to a code repository. |
| Open Datasets | Yes | We evaluate our method on both directions of three standard machine translation datasets with various training data sizes: WMT 14 English German (4.5M sentence pairs), WMT 16 English-Romanian (610k pairs), and WMT 17 English-Chinese (20M pairs). The datasets are tokenized into subword units using BPE (Sennrich et al., 2016). We use the same data and preprocessing as Vaswani et al. (2017), Lee et al. (2018), and Wu et al. (2019) for WMT 14 EN-DE, WMT 16 EN-RO, and WMT 17 EN-ZH respectively. |
| Dataset Splits | Yes | We measure the validation loss at the end of each epoch, and average the 5 best checkpoints based on their validation loss to create the final model. ... Similarly we use ℓ= 5 length candidates for CMLM models, tune the length multiplier (λ {1.05 . . . 1.1}), and the target skipping penalty (δ {1 . . . 5}) on the validation set. |
| Hardware Specification | Yes | We train all models with mixed precision floating point arithmetic on 16 Nvidia V100 GPUs. |
| Software Dependencies | No | The paper mentions using Adam (Kingma & Ba, 2015) for optimization and BERT (Devlin et al., 2018) for weight initialization, which are methods/models. It does not provide specific version numbers for software libraries or environments (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | We generally follow the transformer base hyperparameters (Vaswani et al., 2017): 6 layers for the encoder and decoder, 8 attention heads per layer, 512 model dimensions, and 2048 hidden dimensions. ... For regularization, we set dropout to 0.3, and use 0.01 L2 weight decay and label smoothing with ε = 0.1. We train batches of 128k tokens using Adam (Kingma & Ba, 2015) with β = (0.9, 0.999) and ε = 10 6. The learning rate warms up to 5 10 4 within 10k steps, and then decays with the inverse square-root schedule. We train all models for 300k steps. |