Recurrent Stacking of Layers for Compact Neural Machine Translation Models

Authors: Raj Dabre, Atsushi Fujita6292-6299

AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We report on an extensive case study on neural machine translation (NMT) using our proposed method, experimenting with a variety of datasets. We empirically show that the translation quality of a model that recurrently stacks a single-layer 6 times, despite its significantly fewer parameters, approaches that of a model that stacks 6 different layers.
Researcher Affiliation Academia Raj Dabre, Atsushi Fujita National Institute of Information and Communications Technology 3-5 Hikaridai, Seika-cho, Soraku-gun, Kyoto, 619-0289, Japan firstname.lastname@nict.go.jp
Pseudocode No No structured pseudocode or algorithm blocks were found in the paper.
Open Source Code Yes See several sentence-level selfand cross-attention visualizations in our supplementary material.15 (Footnote 15: https://github.com/prajdabre/RSNMT)
Open Datasets Yes For our Japanese English (Ja-En) translation for both directions, we used the Asian Language Treebank (ALT) parallel corpus3 (Thu et al. 2016), the Global Communication Plan (GCP) corpus4 (Imamura and Sumita 2018), the Kyoto free translation task (KFTT) corpus,5 and the Asian Scientific Paper Excerpt Corpus (ASPEC)6 (Nakazawa et al. 2016). We also experimented with Turkish English (Tr-En) language pair using the WMT 2018 corpus.7
Dataset Splits Yes Table 1: Datasets and model settings. (Includes 'Dev' column with sentence counts). We used newstest2016 for development, and newstest2017 (test17) and newstest2018 (test18) for testing.
Hardware Specification No No specific GPU or CPU models were mentioned. The paper only states 'transformer base single gpu' for default settings and '4 GPUs for training' without specifying the model of the GPU.
Software Dependencies Yes We implemented our method on top of an open-source implementation of the Transformer model (Vaswani et al. 2017) in the version 1.6 branch of tensor2tensor.
Experiment Setup Yes For training, we used the default model settings corresponding to transformer base single gpu (Vaswani et al. 2017), except the number of sub-words, training iterations, and number of GPUs. ... The details of sub-word vocabularies and training iterations are in Table 1. ... We decoded the test set sentences with a beam size of 4 and length penalty of α = 0.6 for the KFTT Japanese-to-English experiments and α = 1.0 for the rest.