Evaluating the Cross-Lingual Effectiveness of Massively Multilingual Neural Machine Translation

Authors: Aditya Siddhant, Melvin Johnson, Henry Tsai, Naveen Ari, Jason Riesa, Ankur Bapna, Orhan Firat, Karthik Raman8854-8861

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we evaluate the cross-lingual effectiveness of representations from the encoder of a massively multilingual NMT model on 5 downstream classification and sequence labeling tasks covering a diverse set of over 50 languages. We compare against a strong baseline, multilingual BERT (m BERT) (Devlin et al. 2018), in different cross-lingual transfer learning scenarios and show gains in zero-shot transfer in 4 out of these 5 tasks.
Researcher Affiliation Industry Aditya Siddhant, Melvin Johnson, Henry Tsai, Naveen Ari, Jason Riesa, Ankur Bapna, Orhan Firat, Karthik Raman Google Research {adisid, melvinp, henrytsai, navari, reisa, ankurbpn, orhanf, karthikraman}@google.com
Pseudocode No The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code No The paper mentions open-source implementations and provides a footnote link to the Sentencepiece GitHub page (https://github.com/google/sentencepiece), but it does not state that the authors are releasing the code for their own Massively Multilingual Translation Encoder (MMTE) or their experimental setup.
Open Datasets Yes We train our multilingual NMT system on a massive scale, using an in-house corpus generated by crawling and extracting parallel sentences from the web (Uszkoreit et al. 2010). This corpus contains parallel documents for 102 languages, to and from English, comprising a total of 25 billion sentence pairs.
Dataset Splits Yes The in-language setting has training, development and test sets from the language. In the zero-shot setting, the train and dev sets contain only English examples but we test on all the languages.
Hardware Specification No The paper describes the model architecture and parameters ('Transformer Big containing 375M parameters'), but does not specify the hardware (e.g., CPU, GPU models, memory) used for training or inference.
Software Dependencies No The paper mentions the use of 'Transformer architecture (Vaswani et al. 2017) in the open-source implementation under the Lingvo framework (Shen et al. 2019)' and 'sentence-piece model (SPM)1 (Kudo and Richardson 2018)'. While these tools are named, specific version numbers for Lingvo or Sentencepiece are not provided in the text or footnotes.
Experiment Setup Yes We use a larger version of Transformer Big containing 375M parameters (6 layers, 16 heads, 8192 hidden dimension) (Chen et al. 2018), and a shared source-target sentence-piece model (SPM)1 (Kudo and Richardson 2018) vocabulary with 64k individual tokens. All our models are trained with Adafactor (Shazeer and Stern 2018) with momentum factorization, a learning rate schedule of (3.0, 40k)2 and a per-parameter norm clipping threshold of 1.0.