Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Automatic Language Identification in Texts: A Survey

Authors: Tommi Jauhiainen, Marco Lui, Marcos Zampieri, Timothy Baldwin, Krister Lindén

JAIR 2019 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical This article provides a brief history of LI research, and an extensive survey of the features and methods used in the LI literature. Finally, we identify open issues, survey the work to date on each issue, and propose future directions for research in LI.
Researcher Affiliation Academia Tommi Jauhiainen EMAIL Department of Digital Humanities The University of Helsinki Marco Lui EMAIL School of Computing and Information Systems The University of Melbourne Marcos Zampieri EMAIL College of Liberal Arts Rochester Institute of Technology Timothy Baldwin EMAIL School of Computing and Information Systems The University of Melbourne Krister Lindén EMAIL Department of Digital Humanities The University of Helsinki
Pseudocode No The paper uses mathematical formulas to describe methods (e.g., Equation 1, 2, 15, 16) but does not present any structured pseudocode or algorithm blocks written by the authors of this survey paper. It refers to pseudocode in a work by another author: 'Dongen (2017) presents the pseudo code for her dictionary lookup tool'.
Open Source Code No The paper is a survey and discusses various off-the-shelf and open-source language identifiers developed by other researchers, such as Text Cat, libtextcat, whatlang-rs, Chrome CLD, Lang Detect, langid.py, whatlang, and YALI. However, the authors of this survey paper do not state that they are releasing their own code for the work described in this paper.
Open Datasets Yes Table 11: Published LI Datasets, which includes specific URLs for each dataset, such as 'Baldwin and Lui (2010a) Multilingual (81) Government Documents, News Texts, Wikipedia https://github.com/varh1i/language_detection/tree/master/src/main/resources/naacl2010-langid' and 'Lui and Baldwin (2011) Multilingual Various http://people.eng.unimelb.edu.au/tbaldwin/etc/ijcnlp2011-langid.tgz'.
Dataset Splits No The paper generally discusses evaluation practices in LI research, mentioning 'training and test data, or partitions for cross-validation' in Section 7.1. However, it does not specify any particular dataset splits used by the authors themselves for their own experiments within this survey.
Hardware Specification No The paper does not provide specific hardware details (exact GPU/CPU models, processor types, or memory amounts) used by the authors to conduct their research or experiments. It is a survey paper reviewing existing literature.
Software Dependencies No The paper discusses various software tools and implementations from other research (e.g., Text Cat, langid.py, SRILM toolkit, WEKA, scikit-learn) but does not provide specific ancillary software details with version numbers used by the authors for their own work presented in this survey.
Experiment Setup No As a survey paper, the document does not describe specific experimental setup details, hyperparameter values, or training configurations used by the authors for their own research. It synthesizes findings and methods from existing literature.