reproducibilityindex.ai

Metamorphic Testing and Certified Mitigation of Fairness Violations in NLP Models

Authors: Pingchuan Ma, Shuai Wang, Jin Liu

IJCAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate our technique using popular (commercial) NLP models, and successfully ﬂag thousands of discriminatory inputs that can cause fairness violations. We further enhance the evaluated models by adding certiﬁed fairness guarantee at a modest cost. Our extensive evaluation delineates the capabilities of de facto commercial NLP services provided by industry giants and also commonly-used local NLP models by exposing in total 2,874 discriminatory inputs (among which 441 are from commercial NLP services). We further enhance the evaluated (commercial) NLP models w.r.t. the certiﬁed guarantees at a modest cost.
Researcher Affiliation	Academia	Pingchuan Ma1,2 , Shuai Wang1 and Jin Liu2 1The Hong Kong University of Science and Technology 2Beijing Electronic Science and Technology Institute pingchuan@ieee.org, shuaiw@cse.ust.hk, liujin@besti.edu.cn
Pseudocode	Yes	Algorithm 1: MT for Model Fairness Violation; Algorithm 2: Sentence Perturbator P; Algorithm 3: Analogy Mutation; Algorithm 4: Active Mutation
Open Source Code	No	The paper does not provide an explicit statement or a link to the open-source code for the methodology it describes.
Open Datasets	Yes	We present a motivating example by training a CNN model for SA with the Large Movie Review Dataset.1; We locally trained two SA models with the Large Movie Review training dataset [Maas et al., 2011]. 1https://ai.stanford.edu/ amaas/data/sentiment/
Dataset Splits	No	The paper mentions a "Large Movie Review test dataset" and "Large Movie Review training dataset" but does not specify the exact percentages or counts for training, validation, and test splits needed for reproducibility. It also does not reference predefined splits with citations for these specific datasets beyond their initial mention.
Hardware Specification	No	The paper does not explicitly describe the hardware used to run its experiments. It mentions CPU time and multi-core CPUs in the context of cost, but no specific models or specifications are provided.
Software Dependencies	Yes	The CNN model, implemented in Keras (ver. 2.2.4), has one convolutional layer followed by two fully-connected layers. As for the LR model, we use the default setting provided in Scikit-Learn (ver. 0.22.1) and Bo W embedding.
Experiment Setup	No	The paper mentions architectural details for the CNN model (one convolutional layer, two fully-connected layers) and that the LR model uses default Scikit-Learn settings with BoW embedding. However, it does not specify concrete hyperparameter values (e.g., learning rate, batch size, number of epochs, specific optimizer settings) or other detailed training configurations.