reproducibilityindex.ai

$k$-Means Clustering with Distance-Based Privacy

Authors: Alessandro Epasto, Vahab Mirrokni, Shyam Narayanan, Peilin Zhong

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In addition, we empirically demonstrate that our algorithm performs significantly better than previous differentially private clustering algorithms, as well as naive distance-based private clustering baselines.
Researcher Affiliation	Collaboration	Alessandro Epasto Google Research aepasto@google.com Vahab Mirrokni Google Research mirrokni@google.com Shyam Narayanan MIT shyamsn@mit.edu Peilin Zhong Google Research peilinz@google.com
Pseudocode	Yes	Algorithm 1 Main Algorithm: dist-DP k-means (resp., k-median) clustering
Open Source Code	No	The paper states, 'We use the DP coreset implementation provided by the DP baseline for the purpose of the computation of semi-coreset Xj described in Section 5,' referencing an external library (footnote 3: 'https://ai.googleblog.com/2021/10/practical-differentially-private.html'). However, it does not provide an explicit statement or link for the open-source code of its own methodology.
Open Datasets	Yes	We evaluate our algorithm on 6 well-known public datasets brightkite (51406 2), gowalla (107092 2), shuttle (58000 10), skin [12] (245057 4), rangequeries [67] (200000 6) and s-sets [42] (5000 2), where brightkite and gowalla are datasets of geographic locations (latitude and longitude) of users and can be found in Stanford Large Network Dataset Collection (SNAP) [54], shuttle, skin and rangequeries are non-geographic datasets and can be found on UCI Repository [29], and s-sets is another non-geographic dataset and can be found in the clustering benchmark dataset2.
Dataset Splits	No	The paper details the public datasets used and their preprocessing steps, but it does not specify any particular train/validation/test splits, percentages, or predefined methodologies for partitioning the data for reproducibility of the experiment.
Hardware Specification	No	The paper states, 'For running time, though we did not optimize our implementation, each algorithm runs within at most a few minutes in a single thread mode.' However, it does not provide any specific hardware details such as GPU/CPU models, processor types, or memory specifications used for the experiments.
Software Dependencies	No	The paper mentions using the 'Python scikit-learn package' and a 'standard open-source DP library' (referenced in footnote 3 and 6). However, it does not provide specific version numbers for these software components, which is required for reproducibility.
Experiment Setup	Yes	In all experiments, we fix privacy parameters ε = 1, δ = 10 6. ... We choose REP = 5. The random shifted vector is chosen uniformly random from [0, 4]d and thus the cell in the highest level of the quadtree has side length 8. We choose L1 = 5 and L2 = 10. ... We set S = p 2 log(1.25)/(δ/6) d/(ε/6).