Non-autoregressive Machine Translation with Disentangled Context Transformer
Authors: Jungo Kasai, James Cross, Marjan Ghazvininejad, Jiatao Gu
ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our extensive experiments on 7 translation directions with varying data sizes demonstrate that our model achieves competitive, if not better, performance compared to the state of the art in nonautoregressive machine translation while significantly reducing decoding time on average. |
| Researcher Affiliation | Collaboration | Jungo Kasai 1 James Cross 2 Marjan Ghazvininejad 2 Jiatao Gu 2 1Paul G. Allen School of Computer Science & Engineering, University of Washington. Work done at Facebook AI. 2Facebook AI. Correspondence to: Jungo Kasai <jkasai@cs.washington.edu>. |
| Pseudocode | Yes | Algorithm 1 Parallel Easy-First with Length Beam |
| Open Source Code | Yes | Our code is available at https://github.com/facebookresearch/Dis Co. |
| Open Datasets | Yes | We evaluate on 7 directions from four standard datasets with various training data sizes: WMT14 EN-DE (4.5M pairs), WMT16 EN-RO (610K pairs), WMT17 EN-ZH (20M pairs), and WMT14 EN-FR (36M pairs, en fr only). These datasets are all encoded into subword units by BPE (Sennrich et al., 2016). We use the same preprocessed data and train/dev/test splits as prior work for fair comparisons (EN-DE: Vaswani et al., 2017; EN-RO: Lee et al., 2018; EN-ZH: Hassan et al., 2018; Wu et al., 2019; EN-FR: Gehring et al., 2017; Ott et al., 2018). |
| Dataset Splits | Yes | We use the same preprocessed data and train/dev/test splits as prior work for fair comparisons (EN-DE: Vaswani et al., 2017; EN-RO: Lee et al., 2018; EN-ZH: Hassan et al., 2018; Wu et al., 2019; EN-FR: Gehring et al., 2017; Ott et al., 2018). |
| Hardware Specification | Yes | We use 16 Telsa V100 GPUs and accelerate training by mixed precision floating point (Micikevicius et al., 2018), and implement all models with fairseq (Ott et al., 2019). ... All models are implemented in fairseq (Ott et al., 2019) and run on a single Nvidia V100 GPU. |
| Software Dependencies | Yes | We implement all models with fairseq (Ott et al., 2019). |
| Experiment Setup | Yes | Hyperparameters We generally follow the hyperparameters for a transformer base (Vaswani et al., 2017; Ghazvininejad et al., 2019): 6 layers for both the encoder and decoder, 8 attention heads, 512 model dimensions, and 2048 hidden dimensions. We sample weights from N(0, 0.02), initialize biases to zero, and set layer normalization parameters to β = 0, γ = 1 (Devlin et al., 2019). For regularization, we tune the dropout rate from [0.1, 0.2, 0.3] based on dev performance in each direction, and apply weight decay with 0.01 and label smoothing with ε = 0.1. We train batches of approximately 128K tokens using Adam (Kingma & Ba, 2015) with β = (0.9, 0.999) and ε = 10 6. The learning rate warms up to 5 10 4 in the first 10K steps, and then decays with the inverse square-root schedule. We train all models for 300K steps apart from en fr where we make 500K steps to account for the data size. |