Harnessing Code Switching to Transcend the Linguistic Barrier

Authors: Ashiqur R. KhudaBukhsh, Shriphani Palakodety, Jaime G. Carbonell

IJCAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our data set, D, consists of 2.04 million comments...Our results indicate that our approach considerably reduces manual effort in acquiring hope speech written mostly in Romanized Hindi. A further exploratory study on a new COVID-19 data set introduced in this paper demonstrates the generalizability of our cross-lingual sampling technique.
Researcher Affiliation Collaboration Ashiqur R. Khuda Bukhsh1 , Shriphani Palakodety2 and Jaime G. Carbonell1 1School of Computer Science, Carnegie Mellon University 2Onai
Pseudocode Yes Algorithm 1: NN-Sample(S, U)
Open Source Code No Resources and additional details are available at: https://www.cs.cmu.edu/ akhudabu/CodeSwitching2020.html. This statement is ambiguous and does not explicitly confirm the release of source code for the described methodology.
Open Datasets No Our data set, D, consists of 2.04 million comments posted by 791,289 user on 2,890 You Tube videos relevant to this India-Pakistan conflict...The hope speech classifier is trained on an annotated data set, Dtrain hope, of 2,277 positive and 7,716 negative English comments. The paper mentions datasets used but does not provide concrete access information (e.g., URL, DOI, specific citation with author/year for public availability) for them.
Dataset Splits No The paper mentions training data and refers to "in-the-wild performance (on data not belonging to the training or test set)" for a classifier. However, it does not explicitly define or specify a "validation" dataset split used in their own experimental setup for reproduction purposes.
Hardware Specification No The paper does not provide any specific details about the hardware used to run the experiments, such as GPU/CPU models, memory, or cloud resources.
Software Dependencies No The paper does not specify any software dependencies with version numbers, such as programming languages, libraries, or specific tool versions.
Experiment Setup Yes We used a well-known metric to measure the extent of code switching in a document Code Mixing Index (CMI)...When ϵ is set to 0.1, our method obtains the following top 20...Our sampling algorithm is described in Algorithm 1. This algorithm takes a seed set, S, and a sample pool U as inputs and outputs a set, E U, containing nearest neighbors of S in the comment-embedding space...The size parameter is set to 5...we used cosine distance of the embeddings as the distance measure.