$k$-Means Clustering with Distance-Based Privacy
Authors: Alessandro Epasto, Vahab Mirrokni, Shyam Narayanan, Peilin Zhong
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In addition, we empirically demonstrate that our algorithm performs significantly better than previous differentially private clustering algorithms, as well as naive distance-based private clustering baselines. |
| Researcher Affiliation | Collaboration | Alessandro Epasto Google Research aepasto@google.com Vahab Mirrokni Google Research mirrokni@google.com Shyam Narayanan MIT shyamsn@mit.edu Peilin Zhong Google Research peilinz@google.com |
| Pseudocode | Yes | Algorithm 1 Main Algorithm: dist-DP k-means (resp., k-median) clustering |
| Open Source Code | No | The paper states, 'We use the DP coreset implementation provided by the DP baseline for the purpose of the computation of semi-coreset Xj described in Section 5,' referencing an external library (footnote 3: 'https://ai.googleblog.com/2021/10/practical-differentially-private.html'). However, it does not provide an explicit statement or link for the open-source code of its own methodology. |
| Open Datasets | Yes | We evaluate our algorithm on 6 well-known public datasets brightkite (51406 2), gowalla (107092 2), shuttle (58000 10), skin [12] (245057 4), rangequeries [67] (200000 6) and s-sets [42] (5000 2), where brightkite and gowalla are datasets of geographic locations (latitude and longitude) of users and can be found in Stanford Large Network Dataset Collection (SNAP) [54], shuttle, skin and rangequeries are non-geographic datasets and can be found on UCI Repository [29], and s-sets is another non-geographic dataset and can be found in the clustering benchmark dataset2. |
| Dataset Splits | No | The paper details the public datasets used and their preprocessing steps, but it does not specify any particular train/validation/test splits, percentages, or predefined methodologies for partitioning the data for reproducibility of the experiment. |
| Hardware Specification | No | The paper states, 'For running time, though we did not optimize our implementation, each algorithm runs within at most a few minutes in a single thread mode.' However, it does not provide any specific hardware details such as GPU/CPU models, processor types, or memory specifications used for the experiments. |
| Software Dependencies | No | The paper mentions using the 'Python scikit-learn package' and a 'standard open-source DP library' (referenced in footnote 3 and 6). However, it does not provide specific version numbers for these software components, which is required for reproducibility. |
| Experiment Setup | Yes | In all experiments, we fix privacy parameters ε = 1, δ = 10 6. ... We choose REP = 5. The random shifted vector is chosen uniformly random from [0, 4]d and thus the cell in the highest level of the quadtree has side length 8. We choose L1 = 5 and L2 = 10. ... We set S = p 2 log(1.25)/(δ/6) d/(ε/6). |