Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Beyond English-Centric Multilingual Machine Translation

Authors: Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Michael Auli, Armand Joulin

JMLR 2021 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our focus on non-English-Centric models brings gains of more than 10 BLEU when directly translating between non-English directions while performing competitively to the best single systems from the Workshop on Machine Translation (WMT). [...] Finally, we end with a thorough analysis, including human evaluation, of the quality of our 100x100 Many-to-Many translation system (Section 6).
Researcher Affiliation	Collaboration	Angela Fan , Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli , Armand Joulin Facebook AI, LORIA . Corresponding Author. Email: EMAIL.
Pseudocode	No	The paper describes methods through narrative text, equations, and diagrams (e.g., Figure 1 for model architecture and data mining strategy), but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	We open-source our scripts so that others may reproduce the data, evaluation, and final m2m100 model: https://github.com/pytorch/fairseq/tree/master/examples/m2m_100.
Open Datasets	Yes	FLORES3 (Guzmán et al., 2019) pairs two low resource languages, Sinhala and Nepali, with English in the Wikipedia domain.3 This can be found here: https://github.com/facebookresearch/flores. [...] We leverage and extend the corpus provided by two of these mining projects: CCMatrix (Schwenk et al., 2019b)18 and CCAligned19 (El-Kishky et al., 2020).18 This can be downloaded here:https://github.com/facebookresearch/LASER/tree/master/tasks/CCMatrix.19 This can be downloaded here: http://www.statmt.org/cc-aligned.
Dataset Splits	Yes	Autshumato5 is an 11-way parallel data set comprising 10 African languages and English from the government domain. There is no standard valid/test split, so we use the ﬁrst half of the data set for valid and second half for test.
Hardware Specification	No	The training time for the 418M parameter model was approximately 8 days on 128 GPU. The training time for the 1.2B parameter model was approximately 11 days on 128 GPU. [...] The 12B model trained for 33 days on 512 GPUs.
Software Dependencies	Yes	For most languages, we use the moses tokenizer (Koehn et al., 2007). For Chinese we use the Sacre BLEU tokenizer (tok zh) and convert all traditional characters generated by the model to simpliﬁed characters using Hanzi Conv8 (Post, 2018),9 for Indian languages we use the Indic NLP library (Kunchukuttan, 2020),10 for Japanese we use Kytea,11 for Thai we use Py Thai NLP (Phatthiyaphaibun et al., 2016),12 for Arabic we use the QCRI Arabic Normalizer,13 for Korean we use Mecab,14 for Burmese we use the oﬃcial segmentation tool provided by Ding et al. (2019).
Experiment Setup	Yes	Our starting point for improving massively multilingual translation models is a large Transformer model, with 12 Encoder and 12 Decoder layers, with 8192 hidden units in the FFN and 1024 embedding dimension. We share the weight matrices of the input and output embeddings. The total parameter count is 1.2B. We train with the Adam optimizer (Kingma and Ba, 2015) and warmup first for 4000 updates (setting the batch size to 4000 tokens), with label smoothing 0.1 (Szegedy et al., 2015; Pereyra et al., 2017). For regularization, we tune the dropout parameter between {0.1, 0.2, 0.3}.