Are Models Biased on Text without Gender-related Language?

Authors: Catarina G Belém, Preethi Seshadri, Yasaman Razeghi, Sameer Singh

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To systematically benchmark the fairness of popular language models in stereotype-free scenarios, we utilize USE to automatically generate benchmarks without any gender-related language. By leveraging USE s sentence-level score, we also repurpose prior gender bias benchmarks (Winobias and Winogender) for non-stereotypical evaluation. Surprisingly, we find low fairness across all 28 tested models.
Researcher Affiliation Academia Catarina Belem, Preethi Seshadri, Yasaman Razeghi, Sameer Singh Department of Computer Science University of California Irvine {cbelem,preethi,yrazeghi,sameer}@uci.edu
Pseudocode No The paper describes a pipeline for generating benchmarks but does not provide formal pseudocode or algorithm blocks.
Open Source Code Yes We release the full dataset and code at https://ucinlp.github.io/unstereo-eval.
Open Datasets Yes Word-gender correlations are determined empirically using word co-occurrence statistics from PILE a high-quality and publicly available pretraining set used to train popular LMs (Gao et al., 2021). ... Additionally, we utilize Un Stereo Eval to repurpose two commonly used fairness benchmarks whose sentences are already gender-invariant Winobias (WB) and Winogender (WG) (Zhao et al., 2018; Rudinger et al., 2018).
Dataset Splits No The paper mentions "evaluation set Deval" and uses different filtered versions of existing and newly created datasets for evaluation, but does not explicitly define distinct training, validation, and test splits in the conventional sense for model training/evaluation.
Hardware Specification No The paper does not specify any particular hardware used for running the experiments (e.g., GPU models, CPU types, or memory).
Software Dependencies Yes Our experiments are based on Open AI Chat GPT (gpt-3.5-turbo, version available as of September 2023) API.
Experiment Setup Yes We instruct Chat GPT to generate 5 sentences per (gendered pronoun, seed word) pair, using up to 5, 10, or 20 words (see Appendix F). ... Throughout our experiments, we report the value of US score such that it allows for relative differences of less than 65% in the probability space. In other words, this implies that we consider a pair to be skewed if a model assigns 1.65 more probability mass to one sentence over the other. In the log space, this yields ε = log 1.65 0.2175.