Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

A Survey of Cross-lingual Word Embedding Models

Authors: Sebastian Ruder, Ivan Vulić, Anders Søgaard

JAIR 2019 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this survey, we provide a comprehensive typology of cross-lingual word embedding models. We compare their data requirements and objective functions. The recurring theme of the survey is that many of the models presented in the literature optimize for the same objectives... We also discuss the diﬀerent ways cross-lingual word embeddings are evaluated, as well as future challenges and research horizons. Section 10. Evaluation ... We subsequently provide an extensive discussion of the tasks, benchmarks, and challenges of the evaluation of cross-lingual embedding models in Section 10 and outline applications in Section 11. We present general challenges and future research directions in learning cross-lingual word representations in Section 12. Benchmark studies To conclude this section, we summarize the ﬁndings of three recent benchmark studies of cross-lingual embeddings: Upadhyay et al. (2016) evaluate cross-lingual embedding models that require diﬀerent forms of supervision on various tasks.
Researcher Affiliation	Collaboration	Sebastian Ruder EMAIL Insight Research Centre, National University of Ireland, Galway, Ireland Aylien Ltd., Dublin, Ireland Ivan Vulić EMAIL Language Technology Lab, University of Cambridge, UK Anders Søgaard EMAIL University of Copenhagen, Copenhagen, Denmark
Pseudocode	No	No explicit pseudocode or algorithm blocks are found in the paper. The methodologies are described using natural language and mathematical formulations, such as in Section 3 for monolingual embedding models and throughout the descriptions of cross-lingual models.
Open Source Code	No	Available libraries for evaluation on bilingual lexicon induction are the MUSE (Conneau et al., 2018a)19 and Vec Map (Artetxe et al., 2018a)20 projects respectively. The recent xling-eval benchmark by (Glavas et al., 2019)21 includes bilingual lexicon induction as well as three downstream tasks: cross-lingual document classiﬁcation, natural language inference, and information retrieval. Footnote 19: https://github.com/facebookresearch/MUSE. Footnote 20: https://github.com/artetxem/vecmap. Footnote 21: https://github.com/codogogo/xling-eval. These are third-party tools/benchmarks for evaluation, not code for a novel methodology described in this survey paper itself.
Open Datasets	Yes	Word similarity: Word Sim-353 has been translated to Spanish, Romanian, and Arabic (Hassan & Mihalcea, 2009) and to German, Italian, and Russian (Leviant & Reichart, 2015); RG was translated to German (Gurevych, 2005), French, (Joubarne & Inkpen, 2011), Spanish and Farsi (Camacho-Collados, Pilehvar, & Navigli, 2015); and Sim Lex-999 was translated to German, Italian and Russian (Leviant & Reichart, 2015) and to Hebrew and Croatian (Mrkšić et al., 2017b). The Sem Eval 2017 task on cross-lingual and multilingual word similarity (Camacho Collados, Pilehvar, Collier, & Navigli, 2017) has introduced cross-lingual word similarity datasets. Bilingual dictionary induction: Upadhyay et al. (2016) obtain evaluation sets for the task across 26 languages from the Open Multilingual Word Net (Bond & Foster, 2013), while Levy et al. (2017) obtain bilingual dictionaries from Wiktionary. Document classiﬁcation: RCV2 Reuters multilingual corpus. Dependency parsing: Universal Dependencies data (Mc Donald, Nivre, Quirmbach-Brundage, Goldberg, Das, Ganchev, Hall, Petrov, Zhang, Täckström, et al., 2013). POS tagging: Universal Dependencies treebanks (Nivre et al., 2016a), Co NLL-X datasets of European languages (Buchholz & Marsi, 2006). Named entity recognition (NER): Onto Notes (Hovy et al., 2006), Co NLL 2003 (Tjong Kim Sang & De Meulder, 2003) and Spanish and Dutch data from Co NLL 2002 (Tjong Kim Sang, 2002). Dialog state tracking (DST): Multilingual WOZ 2.0 dataset (Wen, Vandyke, Mrkšić, Gašić, Rojas-Barahona, Su, P.-H., Ultes, S., & Young, S., 2017). Sentiment analysis: multilingual Amazon product review dataset of Prettenhofer and Stein (2010). Natural language inference: XNLI (Conneau, Lample, Rinott, Williams, Bowman, Schwenk, & Stoyanov, 2018b).
Dataset Splits	No	The paper is a survey and does not present new experimental results requiring specific dataset splits for reproducibility. It refers to various datasets and tasks, sometimes mentioning 'standard splits' in the context of previous work (e.g., 'we use the standard train/test split from [Author et al., 2020]' - as an example of what might be in other papers, not this one directly).
Hardware Specification	No	The paper is a survey and does not describe any specific hardware used for computational experiments by its authors.
Software Dependencies	No	The paper mentions several software tools and libraries in the context of models and evaluation (e.g., Fast Align, MUSE, VecMap, xling-eval) but does not provide specific version numbers for any software dependencies used in conducting the survey or analysis presented in this paper.
Experiment Setup	No	The paper is a survey and focuses on reviewing existing models and their evaluation. It does not present a novel model or conduct new experiments that would require detailing a specific experimental setup or hyperparameters for its own work.