Proportional Fairness in Non-Centroid Clustering

Authors: Ioannis Caragiannis, Evi Micha, Nisarg Shah

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments on real data suggest that traditional clustering algorithms are highly unfair, whereas Greedy Capture is considerably fairer and incurs only a modest loss in common clustering objectives.
Researcher Affiliation Academia Ioannis Caragiannis Aarhus University iannis@cs.au.dk Evi Micha Harvard University emicha@seas.harvard.edu Nisarg Shah University of Toronto nisarg@cs.toronto.edu
Pseudocode Yes ALGORITHM 1: Greedy Cohesive Clustering(A) Input: Set of agents N, metric d, number of clusters k Output: k-clustering C = (C1, . . . Ck) N N; // Remaining set of agents j 1; // Current cluster number while N = do Cj A(N , d, n/k ); // Find and remove the next cohesive cluster N N \ Cj; j j + 1; end Cj, Cj+1, . . . , Ck ; return C = (C1, . . . , Ck);
Open Source Code No The paper mentions implementing k-means++ and k-medoids from the Scikit-learn project, but it does not provide an explicit statement or a link to the authors' own source code for the methodology described in the paper.
Open Datasets Yes We consider three different datasets from the UCI Machine Learning Repository [19], namely Census Income, Diabetes, and Iris. For the first two datasets, each data point corresponds to a human being, and it is reasonable to assume that each individual prefers to be clustered along with other similar individuals. We also consider the third dataset for an interesting comparison with the empirical work of Chen et al. [1], who compared the same algorithms but for centroid clustering. [19] D. Dua and C. Graff. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.
Dataset Splits No The paper states, "For both the Census Income and Diabetes datasets, we sample 100 data points and plot the means and standard deviations over 40 runs." While it mentions sampling and multiple runs, it does not provide specific details on how the dataset was split into training, validation, and testing sets, or the percentages/counts for these splits.
Hardware Specification Yes We conducted our experiments on a server with 32 cores / 64 threads at 4.2 GHz and 128 GB of RAM.
Software Dependencies No The paper states, "We implement the standard k-means++ and k-medoids clustering algorithms from the Scikit-learn project8." While it names Scikit-learn, it does not provide a specific version number for this or any other software dependency, which is required for reproducibility.
Experiment Setup Yes We implement the standard k-means++ and k-medoids clustering algorithms from the Scikit-learn project8, averaging the values for each measure over 20 runs, as their outcomes depend on random initializations. The computation of Greedy Capture neither uses randomization nor depends on the loss function with respect to which the core or FJR approximation is measured. Since calculating core and FJR approximations requires considerable time, for both the Census Income and Diabetes datasets, we sample 100 data points and plot the means and standard deviations over 40 runs. For the former, we conduct weighted sampling according to the fnlwgt feature.