Human-Guided Fair Classification for Natural Language Processing

Authors: Florian E. Dorner, Momchil Peychev, Nikola Konstantinov, Naman Goel, Elliott Ash, Martin Vechev

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experimental results, based on a large dataset for online content moderation, show that in this context our pipeline effectively generates a set of candidate pairs that covers more diverse perturbations than existing word replacement based approaches and successfully leverages human feedback to verify and filter these candidate pairs.
Researcher Affiliation Academia 1ETH Zurich, 2MPI for Intelligent Systems, Tübingen, 3University of Oxford Correspondence to: florian.dorner@tuebingen.mpg.de
Pseudocode No The paper describes methods in text but does not include any pseudocode or algorithm blocks.
Open Source Code Yes We provide code to reproduce our generation pipeline and our experiments on synthetic data, as well as our dataset of human fairness judgments at https://github.com/eth-sri/ fairness-feedback-nlp
Open Datasets Yes We focus on toxicity classification on the Jigsaw Civil Comments dataset2. The dataset contains around 2 million online comments s, as well as labels toxic(s) indicating the fraction of human labelers that considered comment s toxic. We define binary classification labels y(s) := toxic(s) > 0.5. A subset D of the Civil Comments dataset also contains labels Aj(s) that indicate the fraction of human labelers that think comment s mentions the demographic group j. We again define binary classification labels as yj(s) := Aj(s) > 0.5 for these comments, and use them to train our group-presence classifier c. We only consider the subset D D for which no nan-values are contained in the dataset, and the Ro BERTa-tokenized version of s does not exceed a length of 64 tokens. We furthermore split D into a training set containing 75% of D and a test set containing the other 25%. To build the pool Ce of candidate pairs, for word replacement and style transfer, we attempt to produce modified comments s j mentioning group j for each s D for all demographic groups j with yj(s) = 1 and all possible target groups j . For GPT-3, we use a subset of D due to limited resources. We then combine 42,500 randomly selected pairs (s, s ) with s in the training part of D for word replacement and style transfer each and a total of 15,000 pairs (s, s ) for our three GPT-3 approaches, to form the set of candidate constraints Ce. We similarly construct a set of test constraints of a fourth of Ce s size from the test portion of D . More technical details can be found in App. B. https://www.kaggle.com/competitions/jigsaw-unintended-bias-in-toxicity-classification/data
Dataset Splits No We furthermore split D into a training set containing 75% of D and a test set containing the other 25%. The paper explicitly defines training and test splits but does not mention a validation split percentage or specific partitioning for a validation set from the main dataset.
Hardware Specification No The paper mentions models like RoBERTa and BART, but does not specify the hardware (e.g., GPU/CPU models) used for training or inference.
Software Dependencies Yes All of our experiments involving transformer language models use the huggingface transformers library Wolf et al. (2020). We accessed GPT-3 using Open AI s API7. For our first approach, we used the "text-davinci-001" version of GPT3 in a zero-shot manner... The second approach was based on the beta-version of GPT-3 s editing mode 8. Here, s is produced using the model "text-davinci-edit-001"...
Experiment Setup Yes We train c for 3 epochs with a batch size of 16 and use the Adam optimizer Kingma & Ba (2015) with learning rate 0.00001 to optimize the binary Cross Entropy loss, reweighed by relative label frequency in the dataset. The BART-based generator g is trained starting from the pretrained facebook/bart-large model for a single epoch with batch size 4, again using Adam and a learning rate of 0.00001. We used temperature = 0.7 and top_p= 1 in all our approaches and used max_tokens= 64 for "text-davinci-001" to control the length of the modified sentence s.