Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Differentially Private Clustering in High-Dimensional Euclidean Spaces
Authors: Maria-Florina Balcan, Travis Dick, Yingyu Liang, Wenlong Mou, Hongyang Zhang
ICML 2017 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on both synthetic and real datasets verify the effectiveness of the proposed method. |
| Researcher Affiliation | Academia | 1Carnegie Mellon University, Pittsburgh, PA, USA 2Princeton University, Princeton, NJ, USA 3Peking University, Beijing, China. |
| Pseudocode | Yes | Algorithm 1 private partition({xi}n i=1, ϵ, δ, Q). Algorithm 2 candidate({xi}n i=1, ϵ, δ). Algorithm 3 localswap({xi}n i=1, C, ϵ, δ). Algorithm 4 Private Clustering. Algorithm 5 Privately Recover Centers for Sparse Dataset. |
| Open Source Code | No | The paper does not provide concrete access to source code for the described methodology. |
| Open Datasets | Yes | MNIST: The raw pixels of MNIST (Le Cun et al., 1998). CIFAR-10: 100k randomly sampled examples from the CIFAR10 dataset (Krizhevsky, 2009). Synthetic: A synthetic dataset of 100k samples drawn from a mixture of 64 Gaussians in R100. |
| Dataset Splits | No | The paper does not provide specific dataset split information (exact percentages, sample counts, or detailed splitting methodology) for training, validation, or test sets. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types, or memory amounts) used for running its experiments. |
| Software Dependencies | No | The paper mentions software like 'k-means++', 'Su LQ k-means', 'Gaussian mechanism', but does not specify any version numbers for these or other software dependencies. |
| Experiment Setup | Yes | The implementation of our algorithm projects to a space of dimension p = log(n)/2, rather than 8 log(n) and repeats the candidate set construction routine only k times. Finally, we perform 8 iterations of the Su LQ k-means algorithm to further improve the quality of the resulting centers. These modifications do not affect the privacy guarantee of the algorithm, but gave improved empirical performance. Our implementation of the Su LQ k-means algorithm runs for 20 iterations and uses the Gaussian mechanism to approximate the sum of points in each cluster, since this allowed us to add less noise. The Su LQ algorithm initializes its centers to be k randomly chosen points from the bounding box of the data. Unless otherwise stated, we set ϵ = 1.0. |