$k$-Means Clustering with Distance-Based Privacy

Authors: Alessandro Epasto, Vahab Mirrokni, Shyam Narayanan, Peilin Zhong

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In addition, we empirically demonstrate that our algorithm performs significantly better than previous differentially private clustering algorithms, as well as naive distance-based private clustering baselines.
Researcher Affiliation Collaboration Alessandro Epasto Google Research aepasto@google.com Vahab Mirrokni Google Research mirrokni@google.com Shyam Narayanan MIT shyamsn@mit.edu Peilin Zhong Google Research peilinz@google.com
Pseudocode Yes Algorithm 1 Main Algorithm: dist-DP k-means (resp., k-median) clustering
Open Source Code No The paper states, 'We use the DP coreset implementation provided by the DP baseline for the purpose of the computation of semi-coreset Xj described in Section 5,' referencing an external library (footnote 3: 'https://ai.googleblog.com/2021/10/practical-differentially-private.html'). However, it does not provide an explicit statement or link for the open-source code of its own methodology.
Open Datasets Yes We evaluate our algorithm on 6 well-known public datasets brightkite (51406 2), gowalla (107092 2), shuttle (58000 10), skin [12] (245057 4), rangequeries [67] (200000 6) and s-sets [42] (5000 2), where brightkite and gowalla are datasets of geographic locations (latitude and longitude) of users and can be found in Stanford Large Network Dataset Collection (SNAP) [54], shuttle, skin and rangequeries are non-geographic datasets and can be found on UCI Repository [29], and s-sets is another non-geographic dataset and can be found in the clustering benchmark dataset2.
Dataset Splits No The paper details the public datasets used and their preprocessing steps, but it does not specify any particular train/validation/test splits, percentages, or predefined methodologies for partitioning the data for reproducibility of the experiment.
Hardware Specification No The paper states, 'For running time, though we did not optimize our implementation, each algorithm runs within at most a few minutes in a single thread mode.' However, it does not provide any specific hardware details such as GPU/CPU models, processor types, or memory specifications used for the experiments.
Software Dependencies No The paper mentions using the 'Python scikit-learn package' and a 'standard open-source DP library' (referenced in footnote 3 and 6). However, it does not provide specific version numbers for these software components, which is required for reproducibility.
Experiment Setup Yes In all experiments, we fix privacy parameters ε = 1, δ = 10 6. ... We choose REP = 5. The random shifted vector is chosen uniformly random from [0, 4]d and thus the cell in the highest level of the quadtree has side length 8. We choose L1 = 5 and L2 = 10. ... We set S = p 2 log(1.25)/(δ/6) d/(ε/6).