Generalized Reductions: Making any Hierarchical Clustering Fair and Balanced with Low Cost

Authors: Marina Knittel, Max Springer, John P Dickerson, Mohammadtaghi Hajiaghayi

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our results using Dasgupta s cost function, perhaps one of the most prevalent theoretical metrics for hierarchical clustering evaluation. Our work vastly improves the previous O(n5/6poly log(n)) fair approximation for cost to a near polylogarithmic O(nδpoly log(n)) fair approximation for any constant δ (0, 1). This result establishes a costfairness tradeoff and extends to broader fairness constraints than the previous work. We also show how to alter existing hierarchical clusterings to guarantee fairness and cluster balance across any level in the hierarchy. ... Section 5. Experiments. This section validates our algorithms from Section 4. Our simulations demonstrate that our algorithm incurs only a modest loss in the hierarchical clustering objective and exhibits increased fairness. Specifically, the approximate cost increases as a function of Algorithm 3 s defining parameters: c, δ, and k. ... We additionally illustrate the resulting balance of our hierarchical clustering algorithm by presenting the distribution of the cluster ratios of the projected features (blue to red points) in Figure 5 for the Census data.
Researcher Affiliation Academia 1Department of Computer Science, University of Maryland, College Park, USA. Correspondence to: Marina Knittel <mknittel@umd.edu>, Max Springer <mss423@umd.edu>, John Dickerson <john@cs.umd.edu>, Mohammad Hajiaghayi <hajiagha@cs.umd.edu>.
Pseudocode Yes Algorithm 1 Rebalance Tree; Algorithm 2 Refine Rebalance Tree; Algorithm 3 Fair HC; Algorithm 4 Subtree Search
Open Source Code Yes Implementation. The Python code for the following experiments are available in the Supplementary Material.
Open Datasets Yes Datasets. We use two data sets, Census and Bank, from the UCI data repository (Dua & Graff, 2017).
Dataset Splits No The paper does not provide specific training, validation, or test dataset splits with percentages, counts, or references to predefined splits.
Hardware Specification No The paper does not specify any particular hardware (e.g., GPU models, CPU types, memory) used for running the experiments.
Software Dependencies No The paper mentions 'Python code' but does not specify any software names with version numbers for libraries or frameworks used in the experiments.
Experiment Setup Yes We vary the parameters c {2i}5 i=0, δ ( 1/8), and k {2i}4 i=1 to experimentally validate their theoretical impact on the approximate guarantees of Section 4. ... Parameters were set to c = 4, δ = 3/8, k = 4 for the above clustering result.