Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Unifying Proportional Fairness in Centroid and Non-Centroid Clustering
Authors: Benjamin Cookson, Nisarg Shah, Ziqi Yu
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our main result is a novel algorithm which achieves a constant approximation to the core, in polynomial time, even when the distance metrics used for centroid and non-centroid loss measurements are different. We also derive improved results for more restricted loss functions and the weaker FJR criterion, and establish lower bounds in each case. Finally, we also evaluate the performance of our algorithms on real-world datasets in Appendix F. |
| Researcher Affiliation | Academia | Benjamin Cookson Department of Computer Science University of Toronto EMAIL Nisarg Shah Department of Computer Science University of Toronto EMAIL Ziqi Yu Department of Computer Science University of Toronto EMAIL |
| Pseudocode | Yes | Algorithm 1: Dual Metric Algorithm Algorithm 4: 4-Approximate Most Cohesive Clustering Algorithm Algorithm 5: Non-Centroid Greedy Capture With Greedy Centroid Selection Algorithm 6: Semi-Ball-Growing Algorithm Algorithm 7: Iterative α-MCC Clustering Algorithm 2: Centroid Greedy Capture [1] Algorithm 3: Non-Centroid Greedy Capture [2] |
| Open Source Code | Yes | Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We provide our code as part of the supplementary material, and the datasets we use are openly available. |
| Open Datasets | Yes | We use three datasets from the UCI Machine Learning Repository [29]: Iris, Pima Indians Diabetes, and Adult. These are the three datasets used by Caragiannis et al. [2] for their experiments with non-centroid clustering. [29] Markelle Kelly, Rachel Longjohn, and Kolby Nottingham. The UCI machine learning repository. https://archive.ics.uci.edu, 2025. Accessed: 2025-05-15. |
| Dataset Splits | No | The paper mentions random sampling for trials (e.g., "randomly sample 100 data points in each of 40 independent trials"), but it does not specify explicit training/test/validation dataset splits with percentages, sample counts, or references to predefined splits for model training or evaluation in the conventional sense. The experiments primarily involve evaluating clustering algorithms on datasets. |
| Hardware Specification | No | Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: Yes, our experiments do not require a large amount of compute and can be executed on a personal computer. In the section where we outline our experiments, we mention the steps that were computational bottlenecks, but even these were not a large undertaking to run on a laptop. This justification is too vague and does not provide specific hardware details like GPU/CPU models, processor types, or memory amounts. |
| Software Dependencies | No | The k-means++ and k-medoids implementations are based on the Scikit-learn package in Python. This mentions software names (Scikit-learn, Python) but does not provide specific version numbers for these dependencies. |
| Experiment Setup | Yes | Experimental setup. Following Chen et al. [1], Caragiannis et al. [2], we assume N = M, and use the Euclidean L2 distance metric. We vary two parameters: the number of clusters k ∈ {5, 6, . . . , 25} and the weighted loss parameter λ ∈ {0.1, 0.2, . . . , 0.9}. Specifically, we report experimental results when varying λ ∈ {0.1, 0.2, . . . , 0.9} while fixing k = 15, and when varying k ∈ {5, 6, . . . , 25} while fixing λ = 0.5. |