Position: Considerations for Differentially Private Learning with Large-Scale Public Pretraining
Authors: Florian Tramèr, Gautam Kamath, Nicholas Carlini
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | This work is a position paper, which takes a critical view of the current state of the field and highlights several aspects we find problematic. We thus put forward a call for solutions from the community while we offer some broad suggestions on potential ways to address our concerns, we (intentionally) stop short of technically exploring solutions, as each of these challenges deserves significant attention beyond the scope of this article. |
| Researcher Affiliation | Collaboration | 1Department of Computer Science, ETH Z urich, Z urich, Switzerland 2Cheriton School of Computer Science, University of Waterloo, Waterloo, Ontario, Canada 3Vector Institute, Toronto, Ontario, Canada 4Google Deep Mind, Mountain View, USA. |
| Pseudocode | No | The paper does not contain any pseudocode or algorithm blocks. |
| Open Source Code | No | The paper is a position paper that does not present a new method or implementation, and therefore does not provide open-source code. |
| Open Datasets | Yes | For example, on the Image Net dataset (Deng et al., 2009), in the absence of any pretraining, the approach of Sander et al. (2023) achieves a top-1 accuracy of 39.2%, under a fairly weak provable DP guarantee of ε = 8; more recent work of Tang et al. (2023) improved this slightly to the current state of the art: 39.39%. This represents an almost 6 increase in error rate compared to the best non-private model trained solely on Image Net (at least 86.7% accuracy) (Tu et al., 2022). In contrast, when leveraging a dataset of 4 billion Web images for public pretraining, Berrada et al. (2023) achieve an accuracy of 86.8% at a much more reasonable privacy budget of ε = 1 (with comparable results obtained by De et al. (2022); Mehta et al. (2023)). We thus believe it is necessary for the private learning community to begin considering and curating new benchmarks to properly disentangle advances in non-private representation learning from advances in privacy-preserving learning. Such benchmarks could include existing sensitive datasets that have been released for research purposes, e.g., medical datasets (Johnson et al., 2016; Irvin et al., 2019; Wang et al., 2017; Bejnordi et al., 2017), email corpora (Klimt & Yang, 2004), user reviews (Bennett & Lanning, 2007), etc. |
| Dataset Splits | No | The paper does not provide specific dataset split information (e.g., percentages, counts, or predefined splits) for its own work, as it is a position paper and does not conduct experiments. |
| Hardware Specification | No | The paper does not provide specific hardware details used for running its own experiments, as it is a position paper that does not conduct experiments. |
| Software Dependencies | No | The paper does not provide specific ancillary software details with version numbers, as it is a position paper that does not conduct experiments. |
| Experiment Setup | No | The paper does not provide specific experimental setup details or hyperparameter values for its own work, as it is a position paper that does not conduct experiments. |