Understanding the Origins of Bias in Word Embeddings
Authors: Marc-Etienne Brunet, Colleen Alkalay-Houlihan, Ashton Anderson, Richard Zemel
ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the accuracy of our technique with experimental results on both a simplified corpus of Wikipedia articles in broad use (Wikimedia, 2018), and on a corpus of New York Times articles from 1987–2007 (Sandhaus, 2008). |
| Researcher Affiliation | Academia | Marc-Etienne Brunet 1 2 Colleen Alkalay-Houlihan 1 Ashton Anderson 1 2 Richard Zemel 1 2 1Department of Computer Science, University of Toronto, Toronto, Canada 2Vector Institute for Artificial Intelligence, Toronto, Canada. Correspondence to: Marc-Etienne Brunet <mebrunet@cs.toronto.edu>. |
| Pseudocode | Yes | Algorithm 1 Approximating Differential Bias |
| Open Source Code | Yes | Most of the code used in the experimentation has been made available online2. 2Code at https://github.com/mebrunet/understanding-bias |
| Open Datasets | Yes | This first setup consists of a corpus constructed from a Simple English Wikipedia dump (2017-11-03) (Wikimedia, 2018) using 75-dimensional word vectors. [...] The corpus is constructed from 20 years of New York Times (NYT) articles (Sandhaus, 2008), using 200-dimensional vectors. |
| Dataset Splits | No | The paper mentions 'baseline WEAT effect sizes' and 'perturbation sets', but it does not specify explicit train/validation/test splits by percentage or sample count, nor does it refer to standard predefined splits for reproducibility of data partitioning. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models (e.g., NVIDIA, Tesla), CPU models (e.g., Intel Xeon, AMD Ryzen), or specific cloud computing instances used for running the experiments. It mentions '75-dimensional word vectors' and '200-dimensional vectors' but not the hardware used to compute them. |
| Software Dependencies | No | The paper mentions software systems like 'word2vec', 'GloVe', and 'LiSSA algorithm', but it does not specify the versions of any programming languages (e.g., Python), libraries (e.g., TensorFlow, PyTorch, scikit-learn), or specific solver software with version numbers needed to replicate the experiments. |
| Experiment Setup | Yes | This first setup consists of a corpus constructed from a Simple English Wikipedia dump (2017-11-03) (Wikimedia, 2018) using 75-dimensional word vectors. [...] The corpus is constructed from 20 years of New York Times (NYT) articles (Sandhaus, 2008), using 200-dimensional vectors. [...] The original authors of Glo Ve used xmax = 100 and found good performance with α = 0.75. We use a CBOW architecture with the same vocabulary, vector dimensions, and window size as our Glo Ve embeddings. |