Position: Considerations for Differentially Private Learning with Large-Scale Public Pretraining

Authors: Florian Tramèr, Gautam Kamath, Nicholas Carlini

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical This work is a position paper, which takes a critical view of the current state of the field and highlights several aspects we find problematic. We thus put forward a call for solutions from the community while we offer some broad suggestions on potential ways to address our concerns, we (intentionally) stop short of technically exploring solutions, as each of these challenges deserves significant attention beyond the scope of this article.
Researcher Affiliation Collaboration 1Department of Computer Science, ETH Z urich, Z urich, Switzerland 2Cheriton School of Computer Science, University of Waterloo, Waterloo, Ontario, Canada 3Vector Institute, Toronto, Ontario, Canada 4Google Deep Mind, Mountain View, USA.
Pseudocode No The paper does not contain any pseudocode or algorithm blocks.
Open Source Code No The paper is a position paper that does not present a new method or implementation, and therefore does not provide open-source code.
Open Datasets Yes For example, on the Image Net dataset (Deng et al., 2009), in the absence of any pretraining, the approach of Sander et al. (2023) achieves a top-1 accuracy of 39.2%, under a fairly weak provable DP guarantee of ε = 8; more recent work of Tang et al. (2023) improved this slightly to the current state of the art: 39.39%. This represents an almost 6 increase in error rate compared to the best non-private model trained solely on Image Net (at least 86.7% accuracy) (Tu et al., 2022). In contrast, when leveraging a dataset of 4 billion Web images for public pretraining, Berrada et al. (2023) achieve an accuracy of 86.8% at a much more reasonable privacy budget of ε = 1 (with comparable results obtained by De et al. (2022); Mehta et al. (2023)). We thus believe it is necessary for the private learning community to begin considering and curating new benchmarks to properly disentangle advances in non-private representation learning from advances in privacy-preserving learning. Such benchmarks could include existing sensitive datasets that have been released for research purposes, e.g., medical datasets (Johnson et al., 2016; Irvin et al., 2019; Wang et al., 2017; Bejnordi et al., 2017), email corpora (Klimt & Yang, 2004), user reviews (Bennett & Lanning, 2007), etc.
Dataset Splits No The paper does not provide specific dataset split information (e.g., percentages, counts, or predefined splits) for its own work, as it is a position paper and does not conduct experiments.
Hardware Specification No The paper does not provide specific hardware details used for running its own experiments, as it is a position paper that does not conduct experiments.
Software Dependencies No The paper does not provide specific ancillary software details with version numbers, as it is a position paper that does not conduct experiments.
Experiment Setup No The paper does not provide specific experimental setup details or hyperparameter values for its own work, as it is a position paper that does not conduct experiments.