Recurrent Stacking of Layers for Compact Neural Machine Translation Models
Authors: Raj Dabre, Atsushi Fujita6292-6299
AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We report on an extensive case study on neural machine translation (NMT) using our proposed method, experimenting with a variety of datasets. We empirically show that the translation quality of a model that recurrently stacks a single-layer 6 times, despite its significantly fewer parameters, approaches that of a model that stacks 6 different layers. |
| Researcher Affiliation | Academia | Raj Dabre, Atsushi Fujita National Institute of Information and Communications Technology 3-5 Hikaridai, Seika-cho, Soraku-gun, Kyoto, 619-0289, Japan firstname.lastname@nict.go.jp |
| Pseudocode | No | No structured pseudocode or algorithm blocks were found in the paper. |
| Open Source Code | Yes | See several sentence-level selfand cross-attention visualizations in our supplementary material.15 (Footnote 15: https://github.com/prajdabre/RSNMT) |
| Open Datasets | Yes | For our Japanese English (Ja-En) translation for both directions, we used the Asian Language Treebank (ALT) parallel corpus3 (Thu et al. 2016), the Global Communication Plan (GCP) corpus4 (Imamura and Sumita 2018), the Kyoto free translation task (KFTT) corpus,5 and the Asian Scientific Paper Excerpt Corpus (ASPEC)6 (Nakazawa et al. 2016). We also experimented with Turkish English (Tr-En) language pair using the WMT 2018 corpus.7 |
| Dataset Splits | Yes | Table 1: Datasets and model settings. (Includes 'Dev' column with sentence counts). We used newstest2016 for development, and newstest2017 (test17) and newstest2018 (test18) for testing. |
| Hardware Specification | No | No specific GPU or CPU models were mentioned. The paper only states 'transformer base single gpu' for default settings and '4 GPUs for training' without specifying the model of the GPU. |
| Software Dependencies | Yes | We implemented our method on top of an open-source implementation of the Transformer model (Vaswani et al. 2017) in the version 1.6 branch of tensor2tensor. |
| Experiment Setup | Yes | For training, we used the default model settings corresponding to transformer base single gpu (Vaswani et al. 2017), except the number of sub-words, training iterations, and number of GPUs. ... The details of sub-word vocabularies and training iterations are in Table 1. ... We decoded the test set sentences with a beam size of 4 and length penalty of α = 0.6 for the KFTT Japanese-to-English experiments and α = 1.0 for the rest. |