ViP: A Differentially Private Foundation Model for Computer Vision

Authors: Yaodong Yu, Maziar Sanjabi, Yi Ma, Kamalika Chaudhuri, Chuan Guo

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We implement this training recipe on the LAION400M dataset (Schuhmann et al., 2021). We show that the resulting model, which we call Vi P (Vision transformer with differential Privacy), learns highly useful and transferable representations rivaling that of representation learned by Sim CLR on Image Net while providing a strong DP guarantee with ϵ = 8. In Figure 1, we compare Vi P with other private and non-private models in terms of downstream linear probing accuracy and fine-tuning accuracy for different image datasets
Researcher Affiliation Collaboration 1UC Berkeley 2Meta. Correspondence to: Yaodong Yu <yyu@eecs.berkeley.edu>, Chuan Guo <chuanguo@meta.com>.
Pseudocode No No pseudocode or algorithm blocks are explicitly provided or labeled in the paper. The method is described in descriptive text and diagrams.
Open Source Code Yes Code and DP pre-trained models are available at https://github.com/facebookresearch/Vi P-MAE.
Open Datasets Yes We use 1.05 million samples generated using the Shader21k (Baradad et al., 2022) tool as our synthetic pretraining dataset, and the LAION400M (Schuhmann et al., 2021) as our private pre-training dataset for the Vi P model. We evaluate Vi P and baseline models via non-private linear probing and fine-tuning on the following downstream classification datasets: Image Net-1K (Deng et al., 2009), Places-365 and Places-205 (Zhou et al., 2014), i Naturalist-2021 (Van Horn et al., 2021), CIFAR-100 (Krizhevsky et al., 2009), Caltech101 (Fei-Fei et al., 2006), and Aircraft (Maji et al., 2013).
Dataset Splits No The paper describes using various datasets like ImageNet-1K, Places-365, etc., and details how linear probing and few-shot fine-tuning are performed using 'training samples' or 'K training samples from each class'. However, it does not explicitly specify a validation dataset split or how validation was performed during the training of their Vi P models for hyperparameter tuning or early stopping. While standard datasets have conventional splits, the paper does not explicitly state its own use of a validation set for its specific training process.
Hardware Specification Yes The process of executing each iteration of DP-Adam W for training the Vi P-Base model takes approximately 25 seconds when utilizing 48 A100 (40GB) GPUs. Each epoch of the (Syn)-Vi P-Base model s training process takes roughly 90 seconds to complete with 48 A100 (40GB) GPUs. [...] we consider the case with NVIDIA V100 32GB GPUs. [...] For differentially private training, we used a batch size of 81,920 distributed across 128 GPUs.
Software Dependencies No The paper states: 'Our implementation uses Py Torch, along with the functorch package (Horace He, 2021) for computation of per-sample gradients and the opacus package (Yousefpour et al., 2021) for privacy accounting.' and mentions 'Detectron2 package (Wu et al., 2019)'. However, it does not specify explicit version numbers for these software packages.
Experiment Setup Yes For MAE training, we set the masking ratio to 75%. In terms of DP training, we set ϵ = 8.0 and δ = 1/2n by default for training (ϵ, δ)-DP model. We set the clipping parameter C = 0.1, sampling ratio q = 98304/n, noise parameter σ = 0.48, and the total number of iterations T = 6100. [...] For linear probing, we use Batch Norm [...] and use the LARS [...] optimizer. We choose the base learning rate blr {0.1, 0.05, 0.01}, batch size B = 16, 384, weight decay λ = 0.0. [...] For vision transformer based architectures, we apply the Adam W optimizer with learning rate of lr {3 10 3, 3 10 4, 3 10 5} and set weight decay as 0.05.