Are Models Biased on Text without Gender-related Language?
Authors: Catarina G Belém, Preethi Seshadri, Yasaman Razeghi, Sameer Singh
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To systematically benchmark the fairness of popular language models in stereotype-free scenarios, we utilize USE to automatically generate benchmarks without any gender-related language. By leveraging USE s sentence-level score, we also repurpose prior gender bias benchmarks (Winobias and Winogender) for non-stereotypical evaluation. Surprisingly, we find low fairness across all 28 tested models. |
| Researcher Affiliation | Academia | Catarina Belem, Preethi Seshadri, Yasaman Razeghi, Sameer Singh Department of Computer Science University of California Irvine {cbelem,preethi,yrazeghi,sameer}@uci.edu |
| Pseudocode | No | The paper describes a pipeline for generating benchmarks but does not provide formal pseudocode or algorithm blocks. |
| Open Source Code | Yes | We release the full dataset and code at https://ucinlp.github.io/unstereo-eval. |
| Open Datasets | Yes | Word-gender correlations are determined empirically using word co-occurrence statistics from PILE a high-quality and publicly available pretraining set used to train popular LMs (Gao et al., 2021). ... Additionally, we utilize Un Stereo Eval to repurpose two commonly used fairness benchmarks whose sentences are already gender-invariant Winobias (WB) and Winogender (WG) (Zhao et al., 2018; Rudinger et al., 2018). |
| Dataset Splits | No | The paper mentions "evaluation set Deval" and uses different filtered versions of existing and newly created datasets for evaluation, but does not explicitly define distinct training, validation, and test splits in the conventional sense for model training/evaluation. |
| Hardware Specification | No | The paper does not specify any particular hardware used for running the experiments (e.g., GPU models, CPU types, or memory). |
| Software Dependencies | Yes | Our experiments are based on Open AI Chat GPT (gpt-3.5-turbo, version available as of September 2023) API. |
| Experiment Setup | Yes | We instruct Chat GPT to generate 5 sentences per (gendered pronoun, seed word) pair, using up to 5, 10, or 20 words (see Appendix F). ... Throughout our experiments, we report the value of US score such that it allows for relative differences of less than 65% in the probability space. In other words, this implies that we consider a pair to be skewed if a model assigns 1.65 more probability mass to one sentence over the other. In the log space, this yields ε = log 1.65 0.2175. |