Differentially Private Set Union

Authors: Sivakanth Gopi, Pankaj Gulhane, Janardhan Kulkarni, Judy Hanwen Shen, Milad Shokouhi, Sergey Yekhanin

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Using a Reddit dataset, we demonstrate that our algorithms significantly improve the size of DP set union even when compared to natural generalizations of the existing mechanisms for this problem (see Figure 1). Our algorithms are being productized in industry to make a basic subroutine in an NLP application differentially private. 5. Experiments While the algorithms we described generalize to many domains that involve the release of set union, our experiments will use a natural language dataset.
Researcher Affiliation Industry Sivakanth Gopi 1 Pankaj Gulhane 1 Janardhan Kulkarni 1 Judy Hanwen Shen 1 2 Milad Shokouhi 1 Sergey Yekhanin 1 1Microsoft 2Work done as part of the Microsoft AI Residency Program. Correspondence to: Sivakanth Gopi <sigopi@microsoft.com>, Janardhan Kulkarni <jakul@microsoft.com>, Judy Hanwen Shen <hashe@microsoft.com>.
Pseudocode Yes Algorithm 1 High level meta algorithm for DP Set Union; Algorithm 2 High level meta algorithm for building weighted histogram using a given update policy; Algorithm 3 ℓ1-DESCENT update policy; Algorithm 4 POLICY LAPLACE algorithm for DPSU; Algorithm 5 ℓ2-DESCENT update policy; Algorithm 6 POLICY GAUSSIAN algorithm for DPSU
Open Source Code Yes Our code is made available here https://github.com/ heyyjudes/differentially-private-set-union.
Open Datasets No Our dataset is collected from the subreddit r/Ask Reddit. We take a sample of 15,000 posts from each month between January 2017 and December 2018. We filter out duplicate entries, removed posts, and deleted authors. For text preprocessing, we remove URLs and symbols, lowercase all words, and tokenize using nltk.word tokenize. After preprocessing, we again filter out empty posts to arrive at a dataset of 373,983 posts from 223,388 users. The paper describes how the dataset was collected but does not provide a direct link, DOI, or formal citation for accessing the processed dataset used in the experiments.
Dataset Splits No The paper does not explicitly mention training, validation, or test dataset splits. It describes using a Reddit dataset and averaging results across 5 different shuffles of user ordering, which is a different concept than a typical dataset split for model training and evaluation.
Hardware Specification No The paper does not provide any specific details about the hardware (e.g., GPU models, CPU types, cloud instances) used for conducting the experiments.
Software Dependencies No The paper mentions 'tokenize using nltk' but does not specify the version of NLTK or any other software dependencies with version numbers.
Experiment Setup Yes The privacy parameters are ε = 3 and δ = exp( 10). For each algorithm, we average the results of 5 different shuffles of user ordering and also include the standard deviation in Table 1. The cutoff Γ is calculated using α = 5. From our experiments, choosing α [2, 6] works best empirically. The parameters λ, ρLap are set so as to achieve (ε, δ)-DP as shown in Theorem 3.1. The parameters σ, ρGauss are set so as to achieve (ε, δ)-DP as shown in Theorem 4.1.