Generalized Reductions: Making any Hierarchical Clustering Fair and Balanced with Low Cost
Authors: Marina Knittel, Max Springer, John P Dickerson, Mohammadtaghi Hajiaghayi
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our results using Dasgupta s cost function, perhaps one of the most prevalent theoretical metrics for hierarchical clustering evaluation. Our work vastly improves the previous O(n5/6poly log(n)) fair approximation for cost to a near polylogarithmic O(nδpoly log(n)) fair approximation for any constant δ (0, 1). This result establishes a costfairness tradeoff and extends to broader fairness constraints than the previous work. We also show how to alter existing hierarchical clusterings to guarantee fairness and cluster balance across any level in the hierarchy. ... Section 5. Experiments. This section validates our algorithms from Section 4. Our simulations demonstrate that our algorithm incurs only a modest loss in the hierarchical clustering objective and exhibits increased fairness. Specifically, the approximate cost increases as a function of Algorithm 3 s defining parameters: c, δ, and k. ... We additionally illustrate the resulting balance of our hierarchical clustering algorithm by presenting the distribution of the cluster ratios of the projected features (blue to red points) in Figure 5 for the Census data. |
| Researcher Affiliation | Academia | 1Department of Computer Science, University of Maryland, College Park, USA. Correspondence to: Marina Knittel <mknittel@umd.edu>, Max Springer <mss423@umd.edu>, John Dickerson <john@cs.umd.edu>, Mohammad Hajiaghayi <hajiagha@cs.umd.edu>. |
| Pseudocode | Yes | Algorithm 1 Rebalance Tree; Algorithm 2 Refine Rebalance Tree; Algorithm 3 Fair HC; Algorithm 4 Subtree Search |
| Open Source Code | Yes | Implementation. The Python code for the following experiments are available in the Supplementary Material. |
| Open Datasets | Yes | Datasets. We use two data sets, Census and Bank, from the UCI data repository (Dua & Graff, 2017). |
| Dataset Splits | No | The paper does not provide specific training, validation, or test dataset splits with percentages, counts, or references to predefined splits. |
| Hardware Specification | No | The paper does not specify any particular hardware (e.g., GPU models, CPU types, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions 'Python code' but does not specify any software names with version numbers for libraries or frameworks used in the experiments. |
| Experiment Setup | Yes | We vary the parameters c {2i}5 i=0, δ ( 1/8), and k {2i}4 i=1 to experimentally validate their theoretical impact on the approximate guarantees of Section 4. ... Parameters were set to c = 4, δ = 3/8, k = 4 for the above clustering result. |