INSPECTRE: Privately Estimating the Unseen
Authors: Jayadev Acharya, Gautam Kamath, Ziteng Sun, Huanyu Zhang
ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental Results. We demonstrate the efficacy of our method with experimental evaluations. As a baseline, we compare with the non-private algorithms of (Orlitsky et al., 2016) and (Wu & Yang, 2018). Overall, we find that our algorithms performance is nearly identical, showing that, in many cases, privacy comes (essentially) for free. We begin with an evaluation on synthetic data. Then, inspired by (Valiant & Valiant, 2013; Orlitsky et al., 2016), we ana-lyze text corpus consisting of words from Hamlet, in order to estimate the number of unique words which occur. Finally, we investigate name frequencies in the US census data. and We evaluated our methods for entropy estimation and support coverage on both synthetic and real data. Overall, we found that privacy is quite cheap: private estimators achieve accuracy which is comparable or near-indistinguishable to non-private estimators in many settings. |
| Researcher Affiliation | Academia | 1ECE, Cornell University, Ithaca, New York, USA 2EECS & CSAIL, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA. |
| Pseudocode | No | The paper describes its algorithms and framework in textual paragraphs but does not include any formally structured pseudocode blocks or algorithms labeled as such. |
| Open Source Code | Yes | Code of our implementation is available at https://github.com/HuanyuZhang/INSPECTRE. |
| Open Datasets | Yes | Our investigation on US Census data is also inspired by the fact that this is a setting where privacy is of practical importance, evidenced by the proposed adoption of differential privacy in the 2020 US Census (Dajani et al., 2017). and The Hamlet dataset has mtotal = 31, 999 words, of which 4804 are distinct. Since the distribution is not as oversampled as the Census data, we do not need to subsample the data. |
| Dataset Splits | No | The paper describes how samples are generated or drawn from datasets (e.g., 'generate n = k/2 samples', 'sample n of the mtotal individuals'), but it does not specify explicit train/validation/test splits with percentages or counts, or refer to standard predefined splits. |
| Hardware Specification | No | The paper does not specify any hardware details such as GPU models, CPU types, or memory used for running the experiments. |
| Software Dependencies | No | The paper mentions using implementations based on code from other authors, but it does not provide specific software dependencies with version numbers (e.g., Python 3.x, specific library versions). |
| Experiment Setup | Yes | We choose L = 1.2 log k from this, and use it for all other experiments. and We instead set r = 1/2t loge n(t+1)2 / (t − 1), as previously done in (Orlitsky et al., 2016). and The RMSE is averaged over 100 iterations in the plots. and We compare the performance of SGT, and privatized versions of SGT with parameters ε = 1, 2, and 10. and The RMSE over 100 iterations of this process. |