WEFE: The Word Embeddings Fairness Evaluation Framework

Authors: Pablo Badilla, Felipe Bravo-Marquez, Jorge Pérez

IJCAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct a case study in which we rank various publicly available pre-trained word embeddings using WEAT [Caliskan et al., 2017], RND [Garg et al., 2018], and RNSB [Sweeney and Najafian, 2019] as fairness metrics. Our results show that for the case of gender bias, fairness rankings produced by different metrics tend to be correlated with each other. This correlation is substantially weaker when we consider other bias dimensions such as ethnicity and religion.
Researcher Affiliation Academia 1Department of Computer Science, Universidad de Chile 2Millennium Institute for Foundational Research on Data, IMFD-Chile {pbadilla, fbravo, jperez}@dcc.uchile.cl
Pseudocode No The paper describes the framework using formal definitions and prose but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes We have released WEFE as an open source toolkit2 along with tutorials to reproduce this and other previous studies. 2https://wefe.readthedocs.io/en/latest/
Open Datasets Yes The following are the pre-trained embedding models that we consider: 1) conceptnet, 2) fasttext-wikipedia, 3) glove-twitter, 4) glove-wikipedia, 5) lexvec-commoncrawl, 6) word2vec-googlenews, and 7) word2vec-gender-hard-debiased (also trained on Google News) [Bolukbasi et al., 2016]. We take the attribute word sets pleasant, unpleasant, math and arts from [Caliskan et al., 2017]; the target sets ethnicity-surnames, male and female, and attribute words related to intelligence, appearance, sensitive and occupations were taken from [Garg et al., 2018]; the attribute word set religion was taken from [Manzini et al., 2019]; positive and negative sentiment attribute words were taken from the Bing Liu lexicon [Hu and Liu, 2004].
Dataset Splits No The paper evaluates pre-trained word embeddings and pre-defined query sets. It does not describe training, validation, or test splits for data used in its own experimental setup, as it evaluates existing models, not trains new ones with custom data splits.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU models, CPU types) used for running the experiments.
Software Dependencies No The paper mentions releasing an 'open source toolkit' but does not list specific software dependencies with version numbers required for reproducibility.
Experiment Setup No The paper describes the overall framework and how it's applied in a case study (e.g., which embedding models and query sets were used) but does not detail specific experimental setup parameters such as hyperparameters, learning rates, or batch sizes, as it evaluates pre-trained models rather than training new ones.