Cross-Lingual Dataless Classification for Many Languages
Authors: Yangqiu Song, Shyam Upadhyay, Haoruo Peng, Dan Roth
IJCAI 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We illustrate our approach by experimenting with classifying documents in 88 different languages into the same English label space. In particular, we show that CLESA is better than using a monolingual ESA on the target foreign language and translating the English labels into that language. Moreover, the evaluation on two benchmarks, TED and RCV2, showed that cross-lingual dataless classification outperforms supervised learning methods when a large collection of annotated documents is not available. |
| Researcher Affiliation | Academia | Yangqiu Song1 and Shyam Upadhyay2 and Haoruo Peng2 and Dan Roth2 1Lane Department of CSEE, West Virginia University 2Department of Computer Science, University of Illinois at Urbana-Champaign 1yangqiu.song@mail.wvu.edu, 2{upadhya3,hpeng7,danr}@illinois.edu |
| Pseudocode | No | The paper describes procedural steps in paragraph form but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide an explicit statement or link to the source code for the methodology described. |
| Open Datasets | Yes | We first generate a multi-lingual classification data set across 88 languages by selecting 100 documents from the 20-newsgroups data set [Lang, 1995] and translating them using Google translation into 88 languages. We also use two standard benchmark data sets, TED [Hermann and Blunsom, 2014] and RCV2 (a multi-lingual version of RCV1 [Lewis et al., 2004] for English), to evaluate the cross-lingual dataless classification. |
| Dataset Splits | Yes | Thus, we randomly split the provided training set into 70% training and 30% validation sets. Then we use the training set to train a model and use the validation set to tune the threshold. We average the results over ten trials to select the best threshold. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., CPU/GPU models, memory, or specific machine types) used for running the experiments. |
| Software Dependencies | No | The paper mentions 'Lucene language-dependent tokenizers' and 'Liblinear [Fan et al., 2008]' but does not provide specific version numbers for these or any other software dependencies crucial for replication. |
| Experiment Setup | Yes | We trained the CVM model using the parallel corpora with the default setting as well as the settings indicated in the paper [Hermann and Blunsom, 2014] using their software. The length of the word vector was set to 128, the number of iterations was set to five, and the number of mini-batches was set to ten. |