Multilingual Neural Machine Translation With Soft Decoupled Encoding

Authors: Xinyi Wang, Hieu Pham, Philip Arthur, Graham Neubig

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We test SDE on four low-resource languages from a multilingual TED corpus (Qi et al., 2018). Our method shows consistent improvements over multilingual NMT baselines for all four languages, and importantly outperforms previous methods for multilingual NMT that allow for more intelligent parameter sharing but do not use a two-step process of character-level representation and latent meaning representation (Gu et al., 2018). Our method outperforms the best baseline by about 2 BLEU for one of the low-resource languages, achieving new state-of-the-art results on all four language pairs compared to strong multi-lingually trained and adapted baselines (Neubig & Hu, 2018).
Researcher Affiliation Collaboration Xinyi Wang1, Hieu Pham1,2, Philip Arthur3, and Graham Neubig1 1Language Technology Institute, Carnegie Mellon University, Pittsburgh, PA 15213, USA 2Google Brain, Mountain View, CA 94043, USA 3Monash University, Clayton VIC 3800, Australia
Pseudocode No The paper describes the model architecture and its components but does not provide any pseudocode or algorithm blocks.
Open Source Code Yes 1The source code is available at https://github.com/cindyxinyiwang/SDE
Open Datasets Yes We use the 58-language-to-English TED corpus for experiments. Following the settings of prior works on multilingual NMT (Neubig & Hu, 2018; Qi et al., 2018), we use three low-resource language datasets: Azerbaijani (aze), Belarusian (bel), Galician (glg) to English, and a slightly higher-resource dataset, namely Slovak (slk) to English.
Dataset Splits Yes LRL Train Dev Test HRL Train aze 5.94k 671 903 tur 182k bel 4.51k 248 664 rus 208k glg 10.0k 682 1007 por 185k slk 61.5k 2271 2445 ces 103k
Hardware Specification No The paper mentions 'Amazon for providing GPU credits' in the acknowledgements, but does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for the experiments.
Software Dependencies No The paper describes model components (e.g., LSTM, Adam optimizer) but does not provide specific version numbers for any software libraries, frameworks, or programming languages used.
Experiment Setup Yes We use a 1-layer long-short-term-memory (LSTM) network with a hidden dimension of 512 for both the encoder and the decoder. The word embedding dimension is kept at 128, and all other layer dimensions are set to 512. We use a dropout rate of 0.3 for the word embedding and the output vector before the decoder softmax layer. The batch size is set to be 1500 words. We evaluate by development set BLEU score for every 2500 training batches. For training, we use the Adam optimizer with a learning rate of 0.001. We use learning rate decay of 0.8, and stop training if the model performance on development set doesn t improve for 5 evaluation steps.