Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

On the Convergence and Calibration of Deep Learning with Differential Privacy

Authors: Zhiqi Bu, Hua Wang, Zongyu Dai, Qi Long

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate via numerous experiments that a small clipping norm generally leads to more accurate but less calibrated DP models, whereas a large clipping norm eﬀectively mitigates the calibration issue, preserves a similar accuracy, and provides the same privacy guarantee. We conduct the ﬁrst experiments on DP and calibration with large models at the Transformer level.
Researcher Affiliation	Academia	Zhiqi Bu EMAIL University of Pennsylvania Hua Wang EMAIL University of Pennsylvania Zongyu Dai EMAIL University of Pennsylvania Qi Long EMAIL University of Pennsylvania
Pseudocode	No	The paper describes methods using mathematical equations (e.g., Equation 2.2, Equation 4.1) but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Our code can be found at https://github. com/woodyx218/opacus_global_clipping.
Open Datasets	Yes	CIFAR10 is an image dataset, which contains 50000 training samples and 10000 test samples of 32 32 color images in 10 classes. (Section 5.3) On the MNIST dataset, which contains 60000 training samples and 10000 test samples of 28 28 grayscale images in 10 classes... (Section 5.4) Stanford Natural Language Inference (SNLI) is a collection of human-written English sentence paired with one of three classes: entailment, contradiction, or neutral. The dataset has 550152 training samples and 10000 test samples...10We use SNLI 1.0 from https://nlp.stanford.edu/projects/snli/ (Section 5.5) We experiment on the California Housing data (20640 samples, 8 features) and Wine Quality (1599 samples, 11 features...)...12http://archive.672ics.uci.edu/ml/datasets/Wine+Quality 13http://lib.stat.cmu.edu/datasets/houses.zip (Section 5.6 & Footnotes)
Dataset Splits	Yes	CIFAR10 is an image dataset, which contains 50000 training samples and 10000 test samples of 32 32 color images in 10 classes. (Section 5.3) On the MNIST dataset, which contains 60000 training samples and 10000 test samples of 28 28 grayscale images in 10 classes... (Section 5.4) The dataset has 550152 training samples and 10000 test samples. (Section 5.5) We experiment on the Wine Quality12 (1279 training samples, 320 test samples, 11 features) and California Housing13 (18576 training samples, 2064 test samples, 8 features) datasets in Section 5.2. (Appendix C.4)
Hardware Specification	Yes	Throughout this paper, we use the GDP privacy accountant for the experiments, with Private Vision library (Bu et al., a) (improved on Opacus) and one P100 GPU.
Software Dependencies	Yes	Building on top of the Pytorch Opacus14 library... In this formulation, we can easily implement our global clipping by leveraging the Opacus==0.15 library (which already computes Ci).
Experiment Setup	Yes	For ﬁxed R = 1, η = 0.1, Vi T-base trained with DP-SGD under various noise σ has similar performance on CIFAR10 (setting in Section 5.3). (Figure 1 caption) CIFAR10 is an image dataset, which contains 50000 training samples and 10000 test samples of 32 32 color images in 10 classes. We use the Vision Transformer (Vi T-base, 86 million parameters) which is pre-trained on Image Net and train with DP-SGD for a single epoch. (Section 5.3) On the MNIST dataset...train with DP-SGD... batch size 256, noise scale 1.1, learning rate 0.15/R for each R. (Section 5.4 and Figure 6 caption) SNLI text data with BERT and mix-up training... batch size 32, learning rate 0.0005, noise scale 0.4, clipping norm are 0.1 or 20 (Figure 8 caption) For the California Housing, we use DP-Adam with batch size 256. ... noise σ = 1, clipping norm 1, and learning rate 0.0002. ... For Wine Quality ... DP-GD, noise σ = 35, clipping norm 2, and learning rate 0.03. (Appendix C.4) Across all the two experiments, we set δ = 1 1.1 training sample size and use the four-layer neural network with the following structure... (Appendix C.4)