Differentially Private Set Union
Authors: Sivakanth Gopi, Pankaj Gulhane, Janardhan Kulkarni, Judy Hanwen Shen, Milad Shokouhi, Sergey Yekhanin
ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Using a Reddit dataset, we demonstrate that our algorithms significantly improve the size of DP set union even when compared to natural generalizations of the existing mechanisms for this problem (see Figure 1). Our algorithms are being productized in industry to make a basic subroutine in an NLP application differentially private. 5. Experiments While the algorithms we described generalize to many domains that involve the release of set union, our experiments will use a natural language dataset. |
| Researcher Affiliation | Industry | Sivakanth Gopi 1 Pankaj Gulhane 1 Janardhan Kulkarni 1 Judy Hanwen Shen 1 2 Milad Shokouhi 1 Sergey Yekhanin 1 1Microsoft 2Work done as part of the Microsoft AI Residency Program. Correspondence to: Sivakanth Gopi <sigopi@microsoft.com>, Janardhan Kulkarni <jakul@microsoft.com>, Judy Hanwen Shen <hashe@microsoft.com>. |
| Pseudocode | Yes | Algorithm 1 High level meta algorithm for DP Set Union; Algorithm 2 High level meta algorithm for building weighted histogram using a given update policy; Algorithm 3 ℓ1-DESCENT update policy; Algorithm 4 POLICY LAPLACE algorithm for DPSU; Algorithm 5 ℓ2-DESCENT update policy; Algorithm 6 POLICY GAUSSIAN algorithm for DPSU |
| Open Source Code | Yes | Our code is made available here https://github.com/ heyyjudes/differentially-private-set-union. |
| Open Datasets | No | Our dataset is collected from the subreddit r/Ask Reddit. We take a sample of 15,000 posts from each month between January 2017 and December 2018. We filter out duplicate entries, removed posts, and deleted authors. For text preprocessing, we remove URLs and symbols, lowercase all words, and tokenize using nltk.word tokenize. After preprocessing, we again filter out empty posts to arrive at a dataset of 373,983 posts from 223,388 users. The paper describes how the dataset was collected but does not provide a direct link, DOI, or formal citation for accessing the processed dataset used in the experiments. |
| Dataset Splits | No | The paper does not explicitly mention training, validation, or test dataset splits. It describes using a Reddit dataset and averaging results across 5 different shuffles of user ordering, which is a different concept than a typical dataset split for model training and evaluation. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., GPU models, CPU types, cloud instances) used for conducting the experiments. |
| Software Dependencies | No | The paper mentions 'tokenize using nltk' but does not specify the version of NLTK or any other software dependencies with version numbers. |
| Experiment Setup | Yes | The privacy parameters are ε = 3 and δ = exp( 10). For each algorithm, we average the results of 5 different shuffles of user ordering and also include the standard deviation in Table 1. The cutoff Γ is calculated using α = 5. From our experiments, choosing α [2, 6] works best empirically. The parameters λ, ρLap are set so as to achieve (ε, δ)-DP as shown in Theorem 3.1. The parameters σ, ρGauss are set so as to achieve (ε, δ)-DP as shown in Theorem 4.1. |