Deep Encoder, Shallow Decoder: Reevaluating Non-autoregressive Machine Translation
Authors: Jungo Kasai, Nikolaos Pappas, Hao Peng, James Cross, Noah Smith
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our extensive experiments show that given a sufficiently deep encoder, a single-layer autoregressive decoder can substantially outperform strong non-autoregressive models with comparable inference speed. |
| Researcher Affiliation | Collaboration | Paul G. Allen School of Computer Science & Engineering, University of Washington Facebook AI Allen Institute for AI |
| Pseudocode | No | The paper does not contain any sections explicitly labeled "Pseudocode" or "Algorithm", nor does it present structured, code-like blocks describing a method. |
| Open Source Code | Yes | Our code is available at https://github.com/jungokasai/deep-shallow. |
| Open Datasets | Yes | We experiment with 7 translation directions from four datasets of various training data sizes: WMT14 EN-DE (4.5M pairs, Bojar et al., 2014), WMT16 EN-RO (610K, Bojar et al., 2016), WMT17 EN-ZH (20M, Bojar et al., 2017), and WMT14 EN-FR (36M, EN FR only). |
| Dataset Splits | Yes | We follow the preprocessing and data splits of previous work (EN-DE: Vaswani et al., 2017; EN-RO: Lee et al., 2018; EN-ZH: Hassan et al., 2018; Wu et al., 2019; EN-FR: Gehring et al., 2017). Dev. BLEU is measured after each epoch, and we average the 5 best checkpoints to obtain the final model (Vaswani et al., 2017). |
| Hardware Specification | Yes | S1 and Smax wall-clock time speedups ( 2) are evaluated on the same single Nvidia V100 GPU with 16GB memory. All of our models are implemented in fairseq (Ott et al., 2019) and trained with 16 Telsa V100 GPUs CUDA 10.1, and cu DNN 7.6.3. |
| Software Dependencies | Yes | All of our models are implemented in fairseq (Ott et al., 2019) and trained with 16 Telsa V100 GPUs CUDA 10.1, and cu DNN 7.6.3. |
| Experiment Setup | Yes | Hyperparameters We follow the hyperparameters of the base sized transformer (Vaswani et al., 2017): 8 attention heads, 512 model dimensions, and 2,048 hidden dimensions for both the encoder and decoder. For each model and dataset, the dropout rate is tuned from [0.1,0.2,0.3] based on development BLEU performance. The EN FR models are trained for 500K updates, while others for 300K (Kasai et al., 2020). Dev. BLEU is measured after each epoch, and we average the 5 best checkpoints to obtain the final model (Vaswani et al., 2017). See the appendix for further details. Table 6 provides detailed hyperparameter values. |