Hindi-English Hate Speech Detection: Author Profiling, Debiasing, and Practical Perspectives
Authors: Shivang Chopra, Ramit Sawhney, Puneet Mathur, Rajiv Ratn Shah386-393
AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive comparison against several baselines on two real-world datasets, we demonstrate how targeted hate embeddings combined with social network-based features outperform state of the art, both quantitatively and qualitatively. Additionally, we present an expert-in-the-loop algorithm for bias elimination in the proposed model pipeline and study the prevalence and performance impact of the debiasing. Finally, we discuss the computational, practical, ethical, and reproducibility aspects of the deployment of our pipeline across the Web. |
| Researcher Affiliation | Academia | Shivang Chopra,1 Ramit Sawhney,2 Puneet Mathur,3 Rajiv Ratn Shah4 1Delhi Technological University, Delhi, 2Netaji Subhas Institute of Technology, Delhi, 3University of Maryland College Park, 4IIIT Delhi, Delhi |
| Pseudocode | Yes | Inspired by (Swinger et al. 2018), we propose a Bias Elimination (BE) algorithm to mitigate the effects of such bias and describe it below. Clustering Words: Make k disjoint clusters of words C1, C2, C3, ... C, and their corresponding centroids c1, c2, c3, ... c using k-Means Algorithm. Two Centroid Sub-Clustering: For each cluster, a hyperparameter λ is used to find the set of words closest to the centroid to be de-biased using a two centroid subclustering algorithm. [...] |
| Open Source Code | No | The paper states 'We commit to releasing our annotated pairs list to the community,' but does not make a clear statement about the availability of the source code for the methodology described in the paper. |
| Open Datasets | Yes | To validate the proposed hypothesis, we use two datasets, HS (Bohra et al. 2018) and HEOT (Mathur et al. 2018b). |
| Dataset Splits | No | The paper mentions 'Early Stopping used with a patience of 0.05 on the validation accuracy,' implying a validation set was used, but it only explicitly states a 'train-test split of 80:20' without providing the specific split ratio or size for the validation set. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., GPU model, CPU type) used to run the experiments. |
| Software Dependencies | No | The paper mentions 'Keras Tokenizer' and 'NLTK' but does not specify version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | The models were trained for 20 epochs with Early Stopping used with a patience of 0.05 on the validation accuracy. |