reproducibilityindex.ai

Non-autoregressive Machine Translation with Disentangled Context Transformer

Authors: Jungo Kasai, James Cross, Marjan Ghazvininejad, Jiatao Gu

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our extensive experiments on 7 translation directions with varying data sizes demonstrate that our model achieves competitive, if not better, performance compared to the state of the art in nonautoregressive machine translation while significantly reducing decoding time on average.
Researcher Affiliation	Collaboration	Jungo Kasai 1 James Cross 2 Marjan Ghazvininejad 2 Jiatao Gu 2 1Paul G. Allen School of Computer Science & Engineering, University of Washington. Work done at Facebook AI. 2Facebook AI. Correspondence to: Jungo Kasai <jkasai@cs.washington.edu>.
Pseudocode	Yes	Algorithm 1 Parallel Easy-First with Length Beam
Open Source Code	Yes	Our code is available at https://github.com/facebookresearch/Dis Co.
Open Datasets	Yes	We evaluate on 7 directions from four standard datasets with various training data sizes: WMT14 EN-DE (4.5M pairs), WMT16 EN-RO (610K pairs), WMT17 EN-ZH (20M pairs), and WMT14 EN-FR (36M pairs, en fr only). These datasets are all encoded into subword units by BPE (Sennrich et al., 2016). We use the same preprocessed data and train/dev/test splits as prior work for fair comparisons (EN-DE: Vaswani et al., 2017; EN-RO: Lee et al., 2018; EN-ZH: Hassan et al., 2018; Wu et al., 2019; EN-FR: Gehring et al., 2017; Ott et al., 2018).
Dataset Splits	Yes	We use the same preprocessed data and train/dev/test splits as prior work for fair comparisons (EN-DE: Vaswani et al., 2017; EN-RO: Lee et al., 2018; EN-ZH: Hassan et al., 2018; Wu et al., 2019; EN-FR: Gehring et al., 2017; Ott et al., 2018).
Hardware Specification	Yes	We use 16 Telsa V100 GPUs and accelerate training by mixed precision ﬂoating point (Micikevicius et al., 2018), and implement all models with fairseq (Ott et al., 2019). ... All models are implemented in fairseq (Ott et al., 2019) and run on a single Nvidia V100 GPU.
Software Dependencies	Yes	We implement all models with fairseq (Ott et al., 2019).
Experiment Setup	Yes	Hyperparameters We generally follow the hyperparameters for a transformer base (Vaswani et al., 2017; Ghazvininejad et al., 2019): 6 layers for both the encoder and decoder, 8 attention heads, 512 model dimensions, and 2048 hidden dimensions. We sample weights from N(0, 0.02), initialize biases to zero, and set layer normalization parameters to β = 0, γ = 1 (Devlin et al., 2019). For regularization, we tune the dropout rate from [0.1, 0.2, 0.3] based on dev performance in each direction, and apply weight decay with 0.01 and label smoothing with ε = 0.1. We train batches of approximately 128K tokens using Adam (Kingma & Ba, 2015) with β = (0.9, 0.999) and ε = 10 6. The learning rate warms up to 5 10 4 in the ﬁrst 10K steps, and then decays with the inverse square-root schedule. We train all models for 300K steps apart from en fr where we make 500K steps to account for the data size.