Differentially Private Representation Learning via Image Captioning

Authors: Tom Sander, Yaodong Yu, Maziar Sanjabi, Alain Oliviero Durmus, Yi Ma, Kamalika Chaudhuri, Chuan Guo

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through a series of engineering tricks, we successfully train a DP image captioner (DP-Cap) on a 233M subset of LAION-2B from scratch using a reasonable amount of computation, and obtaining unprecedented high-quality image features that can be used in a variety of downstream vision and vision-language tasks. For example, under a privacy budget of ε = 8 for the LAION dataset, a linear classifier trained on top of learned DP-Cap features attains 65.8% accuracy on Image Net-1K, considerably improving the previous SOTA of 56.5%.
Researcher Affiliation Collaboration 1Meta 2CMAP, Ecole polytechnique 3UC Berkeley 4UCSD.
Pseudocode No The paper describes the methods in prose and uses mathematical equations for DP and loss functions, but does not provide structured pseudocode or algorithm blocks.
Open Source Code Yes Code is available at https://github.com/ facebookresearch/dpcap.
Open Datasets Yes Following the approach introduced by Yu et al. (2023), we first pre-train on the Shader21k dataset (Baradad et al., 2022) of synthetic images. We then train with DP-SGD on a subset comprising 233 million deduplicated (using Sem De Dup (Abbas et al., 2023)), NSFWfiltered and face-blurred (using an approach similar to Yang et al. (2021)) image-caption pairs from the (English-only) LAION-2B dataset (Schuhmann et al., 2022).
Dataset Splits No We use the Image Net-1K (Deng et al., 2009a; Russakovsky et al., 2014), CIFAR-10/100 (Krizhevsky et al., 2009), Places-365/205 (Zhou et al., 2014) and i Naturalist2021 (Van Horn et al., 2021) image classification datasets to assess the performance of learned image representations via full linear probing, few-shot linear probing, and zeroshot prediction.
Hardware Specification Yes a naive implementation of DP-Cap with per-sample gradient computation using functorch would take approximately 61 days(!) on 128 NVIDIA V100 GPUs. The number of GPU hours is estimated on a single NVIDIA V100 GPU with 32GB memory using 100K samples.
Software Dependencies No We use RDP accounting with subsampling from the Opacus library (Yousefpour et al., 2021). Combining these two techniques with DP-SGD requires careful considerations to ensure both a correct DP guarantee as well as numerical stability. We detail the implementation in Appendix A.2.
Experiment Setup Yes Our choice of gradient clipping factor is C = 1, as we did not observe any performance improvement with other values. We always use Adam W (Loshchilov & Hutter, 2018) for training. We use a learning rate of 5.12 10 4. ... We use a maximum length of 40 tokens to process the LAION captions. We use a linear schedule, with 40% of warm-up iterations, and 2 the entire training as decay horizon. As opposed to what was previously observed (De et al., 2022; Sander et al., 2023), the learning rate schedule played an important role for us with DP-SGD training. We use a weight decay of 0.05. For our ε = 8 models, we limited training to 32 epochs, a process that took 5 days utilizing 128 V100 GPUs for the Base model. To fit the privacy budget while utilizing a batch size of 1.3 million and training for 32 epochs, RDP analysis yields σ = 0.728.