reproducibilityindex.ai

Aligned Cross Entropy for Non-Autoregressive Machine Translation

Authors: Marjan Ghazvininejad, Vladimir Karpukhin, Luke Zettlemoyer, Omer Levy

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on machine translation benchmarks demonstrate that AXE substantially boosts the performance of CMLMs, while having the same decoding speed.
Researcher Affiliation	Industry	1Facebook AI Research. Correspondence to: Marjan Ghazvininejad <ghazvini@fb.com>.
Pseudocode	Yes	Algorithm 1 Aligned Cross Entropy
Open Source Code	No	The paper does not provide any statement about releasing source code or a link to a code repository.
Open Datasets	Yes	We evaluate our method on both directions of three standard machine translation datasets with various training data sizes: WMT 14 English German (4.5M sentence pairs), WMT 16 English-Romanian (610k pairs), and WMT 17 English-Chinese (20M pairs). The datasets are tokenized into subword units using BPE (Sennrich et al., 2016). We use the same data and preprocessing as Vaswani et al. (2017), Lee et al. (2018), and Wu et al. (2019) for WMT 14 EN-DE, WMT 16 EN-RO, and WMT 17 EN-ZH respectively.
Dataset Splits	Yes	We measure the validation loss at the end of each epoch, and average the 5 best checkpoints based on their validation loss to create the ﬁnal model. ... Similarly we use ℓ= 5 length candidates for CMLM models, tune the length multiplier (λ {1.05 . . . 1.1}), and the target skipping penalty (δ {1 . . . 5}) on the validation set.
Hardware Specification	Yes	We train all models with mixed precision ﬂoating point arithmetic on 16 Nvidia V100 GPUs.
Software Dependencies	No	The paper mentions using Adam (Kingma & Ba, 2015) for optimization and BERT (Devlin et al., 2018) for weight initialization, which are methods/models. It does not provide specific version numbers for software libraries or environments (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	We generally follow the transformer base hyperparameters (Vaswani et al., 2017): 6 layers for the encoder and decoder, 8 attention heads per layer, 512 model dimensions, and 2048 hidden dimensions. ... For regularization, we set dropout to 0.3, and use 0.01 L2 weight decay and label smoothing with ε = 0.1. We train batches of 128k tokens using Adam (Kingma & Ba, 2015) with β = (0.9, 0.999) and ε = 10 6. The learning rate warms up to 5 10 4 within 10k steps, and then decays with the inverse square-root schedule. We train all models for 300k steps.