reproducibilityindex.ai

FuzzE: Fuzzy Fairness Evaluation of Offensive Language Classifiers on African-American English

Authors: Anthony Rios881-889

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To measure these problems, we need text written in both AAVE and Standard American English (SAE). Specifically, we propose an automated fairness fuzzing tool called Fuzz E to quantify the fairness of text classiﬁers applied to AAVE text using a dataset that only contains text written in SAE. Overall, we ﬁnd that the fairness estimates returned by our technique moderately correlates with the use of real ground-truth AAVE text. We conduct a detailed analysis of the framework using automatic style transfer evaluation metrics. Moreover, we measure the increase of well-known phonetic and syntactic AAVE constructions produced by different style transfer techniques after being applied to SAE text. We also perform a human evaluation study to measure semantic change (e.g., offensive to not-offensive) encountered by transforming the style of text.
Researcher Affiliation	Academia	Anthony Rios Department of Information Systems and Cyber Security University of Texas at San Antonio anthony.rios@utsa.edu
Pseudocode	No	The paper describes the methods and workﬂow in prose and with diagrams (Figure 1), but does not include any explicit pseudocode or algorithm blocks.
Open Source Code	No	The paper does not contain any explicit statements or links indicating that the source code for the described methodology (Fuzz E, style transfer methods) is publicly available.
Open Datasets	Yes	AAVE Dataset (Style Data). Blodgett, Green, and O Connor (2016) originally collected and released more than 59.2 million tweets by 2.8 million users. Offensive Language Datasets. We investigate style transfer and fairness evaluation using two datasets: The Offensive Language Identiﬁcation Dataset (OLID) (Zampieri et al. 2019) and the Hate Speech and Offensive Language (HSOL) Dataset (Davidson et al. 2017).
Dataset Splits	No	The SAE tweets in both datasets are split into a training (80%) and test set (20%). While the paper mentions using bootstrap sampling from the training split for creating multiple models, it does not specify a distinct validation set or its split proportion.
Hardware Specification	No	The paper does not provide any specific details about the hardware used for running the experiments, such as GPU models, CPU types, or memory specifications.
Software Dependencies	No	The paper mentions various models and tools used (e.g., Logistic Regression, CNN, Bi-LSTM, Twokenizer, Ken LM) and cites relevant papers, but it does not provide explicit version numbers for any software dependencies like programming languages, libraries, or frameworks (e.g., Python 3.8, PyTorch 1.9, scikit-learn X.Y).
Experiment Setup	Yes	Using cross-validation, the regularization parameter is optimized for each dataset independently. We found the best regularization parameters for OLID and HSOL to be 0.1 and 1.0, respectively. For the model speciﬁcation of the generator and encoder, we use a twolayer Bi-LSTM with a word embedding size of 300 and hidden dimension size of 500. The generator will create a max sequence of 50 tokens. The CNN classiﬁer is trained with 100 ﬁlters that span 5 words.