Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Automatic Language Identification in Texts: A Survey

Authors: Tommi Jauhiainen, Marco Lui, Marcos Zampieri, Timothy Baldwin, Krister Lindén

JAIR 2019 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Theoretical	This article provides a brief history of LI research, and an extensive survey of the features and methods used in the LI literature. Finally, we identify open issues, survey the work to date on each issue, and propose future directions for research in LI.
Researcher Affiliation	Academia	Tommi Jauhiainen EMAIL Department of Digital Humanities The University of Helsinki Marco Lui EMAIL School of Computing and Information Systems The University of Melbourne Marcos Zampieri EMAIL College of Liberal Arts Rochester Institute of Technology Timothy Baldwin EMAIL School of Computing and Information Systems The University of Melbourne Krister Lindén EMAIL Department of Digital Humanities The University of Helsinki
Pseudocode	No	The paper uses mathematical formulas to describe methods (e.g., Equation 1, 2, 15, 16) but does not present any structured pseudocode or algorithm blocks written by the authors of this survey paper. It refers to pseudocode in a work by another author: 'Dongen (2017) presents the pseudo code for her dictionary lookup tool'.
Open Source Code	No	The paper is a survey and discusses various off-the-shelf and open-source language identifiers developed by other researchers, such as Text Cat, libtextcat, whatlang-rs, Chrome CLD, Lang Detect, langid.py, whatlang, and YALI. However, the authors of this survey paper do not state that they are releasing their own code for the work described in this paper.
Open Datasets	Yes	Table 11: Published LI Datasets, which includes specific URLs for each dataset, such as 'Baldwin and Lui (2010a) Multilingual (81) Government Documents, News Texts, Wikipedia https://github.com/varh1i/language_detection/tree/master/src/main/resources/naacl2010-langid' and 'Lui and Baldwin (2011) Multilingual Various http://people.eng.unimelb.edu.au/tbaldwin/etc/ijcnlp2011-langid.tgz'.
Dataset Splits	No	The paper generally discusses evaluation practices in LI research, mentioning 'training and test data, or partitions for cross-validation' in Section 7.1. However, it does not specify any particular dataset splits used by the authors themselves for their own experiments within this survey.
Hardware Specification	No	The paper does not provide specific hardware details (exact GPU/CPU models, processor types, or memory amounts) used by the authors to conduct their research or experiments. It is a survey paper reviewing existing literature.
Software Dependencies	No	The paper discusses various software tools and implementations from other research (e.g., Text Cat, langid.py, SRILM toolkit, WEKA, scikit-learn) but does not provide specific ancillary software details with version numbers used by the authors for their own work presented in this survey.
Experiment Setup	No	As a survey paper, the document does not describe specific experimental setup details, hyperparameter values, or training configurations used by the authors for their own research. It synthesizes findings and methods from existing literature.