Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

On the Efficacy of Differentially Private Few-shot Image Classification

Authors: Marlon Tobaben, Aliaksandra Shysheya, John F Bronskill, Andrew Paverd, Shruti Tople, Santiago Zanella-Beguelin, Richard E Turner, Antti Honkela

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To understand under which conditions few-shot DP can be effective, we perform an exhaustive set of experiments that reveals how the accuracy and vulnerability to attack of few-shot DP image classification models are affected as the number of shots per class, privacy level, model architecture, downstream dataset, and subset of learnable parameters in the model vary. We show that to achieve DP accuracy on par with non-private models, the shots per class must be increased as the privacy level increases. We also show that learning parameter-efficient Fi LM adapters under DP is competitive with learning just the final classifier layer or learning all of the network parameters. Finally, we evaluate DP federated learning systems and establish state-of-the-art performance on the challenging FLAIR benchmark.
Researcher Affiliation	Collaboration	Marlon Tobaben EMAIL University of Helsinki Aliaksandra Shysheya EMAIL University of Cambridge John Bronskill EMAIL University of Cambridge Andrew Paverd EMAIL Microsoft Shruti Tople EMAIL Microsoft Santiago Zanella-Béguelin EMAIL Microsoft Richard E. Turner EMAIL University of Cambridge Antti Honkela EMAIL University of Helsinki
Pseudocode	No	The paper describes algorithms like DP-SGD and Fed ADAM, but does not present them in pseudocode or algorithm blocks within the paper. For example, it says: "DP-SGD (Rajkumar & Agarwal, 2012; Song et al., 2013; Abadi et al., 2016) adapts stochastic gradient descent (SGD) to guarantee DP." It refers to external sources for the algorithms.
Open Source Code	Yes	Source code for all experiments can be found at: https://github.com/cambridge-mlg/dp-few-shot.
Open Datasets	Yes	Datasets For the experiments where S is varied, we use the CIFAR-10 (low TD) and CIFAR-100 (medium TD) datasets (Krizhevsky, 2009) which are commonly used in DP transfer learning, and SVHN (Netzer et al., 2011) which has a high transfer difficulty and hence requires a greater degree of adaptation of the pretrained backbone. We also evaluate on the challenging VTAB-1k transfer learning benchmark (Zhai et al., 2019) that consists of 19 datasets grouped into three distinct categories (natural, specialized, and structured) with training set size fixed at \|D\| = 1000 and widely varying TD.
Dataset Splits	Yes	Training Protocol For all centralized experiments, we first draw D of the required size (\|D\| = CS (i.e. the number of classes C multiplied by shot S) for varying shot or \|D\| = 1000 for VTAB-1k) from the entire training split of the current dataset under evaluation. For the purposes of hyperparameter tuning, we then split D into 70% train and 30% validation. We then perform 20 iterations of Bayesian optimization based hyperparameter tuning (Bergstra et al., 2011) with Optuna Akiba et al. (2019) to derive a set of hyperparameters that yield the highest accuracy on the validation data. This set of parameters is subsequently used to train a final model on all of D.
Hardware Specification	Yes	All of the effect of S and ϵ experiments were carried out on 1 (for Head and Fi LM) and up to 3 (for All) NVIDIA V100 GPUs with 32GB of memory. ... All of the VTAB-1k transfer learning experiments were carried out on a single NVIDIA A100 GPU with 80GB of memory.
Software Dependencies	No	For DP fine-tuning on D, we use Opacus (Yousefpour et al., 2021) and compute the required noise multiplier depending on the targeted (ϵ, δ). ... All experiments were performed in Tensor Flow using tensorflow-federated Google (2019a) for federated aggregation and tensorflow-privacy Google (2019b) for privacy accounting and the adaptive clipping algorithm Andrew et al. (2021). ... with Optuna Akiba et al. (2019).
Experiment Setup	Yes	Details on the set of hyperparameters that are tuned and their ranges can be found in Appendix A.3.2. ... Table 19: Hyperparameter ranges used for the Bayesian optimization. epochs 1 200, learning rate 1e-7 1e-2, batch size 10 \|D\|, clipping norm 0.2 10, noise multiplier Based on target ϵ.