Deep Encoder, Shallow Decoder: Reevaluating Non-autoregressive Machine Translation

Authors: Jungo Kasai, Nikolaos Pappas, Hao Peng, James Cross, Noah Smith

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our extensive experiments show that given a sufficiently deep encoder, a single-layer autoregressive decoder can substantially outperform strong non-autoregressive models with comparable inference speed.
Researcher Affiliation Collaboration Paul G. Allen School of Computer Science & Engineering, University of Washington Facebook AI Allen Institute for AI
Pseudocode No The paper does not contain any sections explicitly labeled "Pseudocode" or "Algorithm", nor does it present structured, code-like blocks describing a method.
Open Source Code Yes Our code is available at https://github.com/jungokasai/deep-shallow.
Open Datasets Yes We experiment with 7 translation directions from four datasets of various training data sizes: WMT14 EN-DE (4.5M pairs, Bojar et al., 2014), WMT16 EN-RO (610K, Bojar et al., 2016), WMT17 EN-ZH (20M, Bojar et al., 2017), and WMT14 EN-FR (36M, EN FR only).
Dataset Splits Yes We follow the preprocessing and data splits of previous work (EN-DE: Vaswani et al., 2017; EN-RO: Lee et al., 2018; EN-ZH: Hassan et al., 2018; Wu et al., 2019; EN-FR: Gehring et al., 2017). Dev. BLEU is measured after each epoch, and we average the 5 best checkpoints to obtain the final model (Vaswani et al., 2017).
Hardware Specification Yes S1 and Smax wall-clock time speedups ( 2) are evaluated on the same single Nvidia V100 GPU with 16GB memory. All of our models are implemented in fairseq (Ott et al., 2019) and trained with 16 Telsa V100 GPUs CUDA 10.1, and cu DNN 7.6.3.
Software Dependencies Yes All of our models are implemented in fairseq (Ott et al., 2019) and trained with 16 Telsa V100 GPUs CUDA 10.1, and cu DNN 7.6.3.
Experiment Setup Yes Hyperparameters We follow the hyperparameters of the base sized transformer (Vaswani et al., 2017): 8 attention heads, 512 model dimensions, and 2,048 hidden dimensions for both the encoder and decoder. For each model and dataset, the dropout rate is tuned from [0.1,0.2,0.3] based on development BLEU performance. The EN FR models are trained for 500K updates, while others for 300K (Kasai et al., 2020). Dev. BLEU is measured after each epoch, and we average the 5 best checkpoints to obtain the final model (Vaswani et al., 2017). See the appendix for further details. Table 6 provides detailed hyperparameter values.