Dynamic Layer Aggregation for Neural Machine Translation with Routing-by-Agreement

Authors: Zi-Yi Dou, Zhaopeng Tu, Xing Wang, Longyue Wang, Shuming Shi, Tong Zhang86-93

AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We implement our algorithm on top of the state-of-the-art neural machine translation model TRANSFORMER and conduct experiments on the widely-used WMT14 English German and WMT17 Chinese English translation datasets. Experimental results across language pairs show that the proposed approach consistently outperforms the strong baseline model and a representative static aggregation model.
Researcher Affiliation Collaboration Zi-Yi Dou Carnegie Mellon University zdou@andrew.cmu.edu; Zhaopeng Tu* Tencent AI Lab zptu@tencent.com; Xing Wang Tencent AI Lab brightxwang@tencent.com; Longyue Wang Tencent AI Lab vinnylywang@tencent.com; Shuming Shi Tencent AI Lab shumingshi@tencent.com; Tong Zhang Tencent AI Lab bradymzhang@tencent.com
Pseudocode Yes Algorithm 1 Iterative Dynamic Routing. ... Algorithm 2 Iterative EM Routing
Open Source Code No The paper does not contain an explicit statement about releasing open-source code for the described methodology or a link to a code repository.
Open Datasets Yes We conducted experiments on two widely-used WMT14 English German (En De) and WMT17 Chinese English (Zh En) translation tasks and compared our model with results reported by previous work (Gehring et al. 2017; Vaswani et al. 2017; Hassan et al. 2018).
Dataset Splits Yes For the En De task, the training corpus consists of about 4.56 million sentence pairs. We used newstest2013 as the development set and newstest2014 as the test set. For the Zh En task, we used all of the available parallel data, consisting of about 20 million sentence pairs. We used newsdev2017 as the development set and newstest2017 as the test set.
Hardware Specification Yes All the models were trained on eight NVIDIA P40 GPUs where each was allocated with a batch size of 4096 tokens.
Software Dependencies No The paper mentions using 'byte-pair encoding' and evaluating on the 'Transformer model', but it does not specify any software dependencies with version numbers (e.g., PyTorch version, TensorFlow version).
Experiment Setup Yes We followed the configurations in (Vaswani et al. 2017), and reproduced their reported results on the En De task. ... All the models were trained on eight NVIDIA P40 GPUs where each was allocated with a batch size of 4096 tokens. ... The number of output capsules N is a key parameter for our model... Another key parameter is the iteration of the iterative routing T...