INSPECTRE: Privately Estimating the Unseen

Authors: Jayadev Acharya, Gautam Kamath, Ziteng Sun, Huanyu Zhang

ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental Results. We demonstrate the efficacy of our method with experimental evaluations. As a baseline, we compare with the non-private algorithms of (Orlitsky et al., 2016) and (Wu & Yang, 2018). Overall, we find that our algorithms performance is nearly identical, showing that, in many cases, privacy comes (essentially) for free. We begin with an evaluation on synthetic data. Then, inspired by (Valiant & Valiant, 2013; Orlitsky et al., 2016), we ana-lyze text corpus consisting of words from Hamlet, in order to estimate the number of unique words which occur. Finally, we investigate name frequencies in the US census data. and We evaluated our methods for entropy estimation and support coverage on both synthetic and real data. Overall, we found that privacy is quite cheap: private estimators achieve accuracy which is comparable or near-indistinguishable to non-private estimators in many settings.
Researcher Affiliation Academia 1ECE, Cornell University, Ithaca, New York, USA 2EECS & CSAIL, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA.
Pseudocode No The paper describes its algorithms and framework in textual paragraphs but does not include any formally structured pseudocode blocks or algorithms labeled as such.
Open Source Code Yes Code of our implementation is available at https://github.com/HuanyuZhang/INSPECTRE.
Open Datasets Yes Our investigation on US Census data is also inspired by the fact that this is a setting where privacy is of practical importance, evidenced by the proposed adoption of differential privacy in the 2020 US Census (Dajani et al., 2017). and The Hamlet dataset has mtotal = 31, 999 words, of which 4804 are distinct. Since the distribution is not as oversampled as the Census data, we do not need to subsample the data.
Dataset Splits No The paper describes how samples are generated or drawn from datasets (e.g., 'generate n = k/2 samples', 'sample n of the mtotal individuals'), but it does not specify explicit train/validation/test splits with percentages or counts, or refer to standard predefined splits.
Hardware Specification No The paper does not specify any hardware details such as GPU models, CPU types, or memory used for running the experiments.
Software Dependencies No The paper mentions using implementations based on code from other authors, but it does not provide specific software dependencies with version numbers (e.g., Python 3.x, specific library versions).
Experiment Setup Yes We choose L = 1.2 log k from this, and use it for all other experiments. and We instead set r = 1/2t loge n(t+1)2 / (t − 1), as previously done in (Orlitsky et al., 2016). and The RMSE is averaged over 100 iterations in the plots. and We compare the performance of SGT, and privatized versions of SGT with parameters ε = 1, 2, and 10. and The RMSE over 100 iterations of this process.