Cross-Lingual Dataless Classification for Many Languages

Authors: Yangqiu Song, Shyam Upadhyay, Haoruo Peng, Dan Roth

IJCAI 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We illustrate our approach by experimenting with classifying documents in 88 different languages into the same English label space. In particular, we show that CLESA is better than using a monolingual ESA on the target foreign language and translating the English labels into that language. Moreover, the evaluation on two benchmarks, TED and RCV2, showed that cross-lingual dataless classification outperforms supervised learning methods when a large collection of annotated documents is not available.
Researcher Affiliation Academia Yangqiu Song1 and Shyam Upadhyay2 and Haoruo Peng2 and Dan Roth2 1Lane Department of CSEE, West Virginia University 2Department of Computer Science, University of Illinois at Urbana-Champaign 1yangqiu.song@mail.wvu.edu, 2{upadhya3,hpeng7,danr}@illinois.edu
Pseudocode No The paper describes procedural steps in paragraph form but does not include any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide an explicit statement or link to the source code for the methodology described.
Open Datasets Yes We first generate a multi-lingual classification data set across 88 languages by selecting 100 documents from the 20-newsgroups data set [Lang, 1995] and translating them using Google translation into 88 languages. We also use two standard benchmark data sets, TED [Hermann and Blunsom, 2014] and RCV2 (a multi-lingual version of RCV1 [Lewis et al., 2004] for English), to evaluate the cross-lingual dataless classification.
Dataset Splits Yes Thus, we randomly split the provided training set into 70% training and 30% validation sets. Then we use the training set to train a model and use the validation set to tune the threshold. We average the results over ten trials to select the best threshold.
Hardware Specification No The paper does not provide specific hardware details (e.g., CPU/GPU models, memory, or specific machine types) used for running the experiments.
Software Dependencies No The paper mentions 'Lucene language-dependent tokenizers' and 'Liblinear [Fan et al., 2008]' but does not provide specific version numbers for these or any other software dependencies crucial for replication.
Experiment Setup Yes We trained the CVM model using the parallel corpora with the default setting as well as the settings indicated in the paper [Hermann and Blunsom, 2014] using their software. The length of the word vector was set to 128, the number of iterations was set to five, and the number of mini-batches was set to ten.