reproducibilityindex.ai

Deep Encoder, Shallow Decoder: Reevaluating Non-autoregressive Machine Translation

Authors: Jungo Kasai, Nikolaos Pappas, Hao Peng, James Cross, Noah Smith

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our extensive experiments show that given a sufﬁciently deep encoder, a single-layer autoregressive decoder can substantially outperform strong non-autoregressive models with comparable inference speed.
Researcher Affiliation	Collaboration	Paul G. Allen School of Computer Science & Engineering, University of Washington Facebook AI Allen Institute for AI
Pseudocode	No	The paper does not contain any sections explicitly labeled "Pseudocode" or "Algorithm", nor does it present structured, code-like blocks describing a method.
Open Source Code	Yes	Our code is available at https://github.com/jungokasai/deep-shallow.
Open Datasets	Yes	We experiment with 7 translation directions from four datasets of various training data sizes: WMT14 EN-DE (4.5M pairs, Bojar et al., 2014), WMT16 EN-RO (610K, Bojar et al., 2016), WMT17 EN-ZH (20M, Bojar et al., 2017), and WMT14 EN-FR (36M, EN FR only).
Dataset Splits	Yes	We follow the preprocessing and data splits of previous work (EN-DE: Vaswani et al., 2017; EN-RO: Lee et al., 2018; EN-ZH: Hassan et al., 2018; Wu et al., 2019; EN-FR: Gehring et al., 2017). Dev. BLEU is measured after each epoch, and we average the 5 best checkpoints to obtain the ﬁnal model (Vaswani et al., 2017).
Hardware Specification	Yes	S1 and Smax wall-clock time speedups ( 2) are evaluated on the same single Nvidia V100 GPU with 16GB memory. All of our models are implemented in fairseq (Ott et al., 2019) and trained with 16 Telsa V100 GPUs CUDA 10.1, and cu DNN 7.6.3.
Software Dependencies	Yes	All of our models are implemented in fairseq (Ott et al., 2019) and trained with 16 Telsa V100 GPUs CUDA 10.1, and cu DNN 7.6.3.
Experiment Setup	Yes	Hyperparameters We follow the hyperparameters of the base sized transformer (Vaswani et al., 2017): 8 attention heads, 512 model dimensions, and 2,048 hidden dimensions for both the encoder and decoder. For each model and dataset, the dropout rate is tuned from [0.1,0.2,0.3] based on development BLEU performance. The EN FR models are trained for 500K updates, while others for 300K (Kasai et al., 2020). Dev. BLEU is measured after each epoch, and we average the 5 best checkpoints to obtain the ﬁnal model (Vaswani et al., 2017). See the appendix for further details. Table 6 provides detailed hyperparameter values.