Making Existing Clusterings Fairer: Algorithms, Complexity Results and Insights
Authors: Ian Davidson, S.S Ravi3733-3740
AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results on Twitter, Census and NYT data sets show that our methods can modify existing clusterings for data sets in excess of 100,000 instances within minutes on laptops and find as fair but higher quality clusterings than fair by design clustering algorithms. |
| Researcher Affiliation | Academia | Ian Davidson,1 S. S. Ravi2 1Computer Science Department, University of California, Davis 2Biocomplexity Institute & Initiative, University of Virginia and Computer Science Department, University at Albany SUNY |
| Pseudocode | No | No explicit pseudocode or algorithm block was found. The paper refers to a technical report (Davidson and Ravi 2019) for algorithm details. |
| Open Source Code | No | No explicit statement about releasing the source code for the described methodology or a link to a code repository was found. |
| Open Datasets | Yes | Here we first analyze the well studied Adult dataset (e.g., (Chierichetti et al. 2017; Backurs et al. 2019)) that consists of 48,842 individuals (males 66.8%, females 33.2%) from the UCI repository (Dheeru and Karra Taniskidou 2017). |
| Dataset Splits | No | The paper does not explicitly state specific training, validation, or test dataset splits (e.g., percentages, sample counts, or detailed splitting methodology). |
| Hardware Specification | Yes | The mean run time over 100 experiments on a single core of a Mac Book Pro laptop (i5 processor) for a randomly created subset of the data sets. |
| Software Dependencies | No | The paper mentions software tools like MATLAB and BOW toolkit, but does not provide specific version numbers for any key software components or libraries required for reproduction. |
| Experiment Setup | Yes | To make these first two clusters fairer we apply our method by placing bounds on the first and second cluster s protected status ratios to be 0.5 0.05 with the remaining clusters proportion of females to be their current values as reported in Table 2 0.15. This is achieved by setting the Ui and Li bounds in Equations (2) and (3). For each data set we find the best k = 10 clustering using plain k-means and spectral clustering + k-means (both from 1000 random restarts). |