Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

The Elusive Pursuit of Reproducing PATE-GAN: Benchmarking, Auditing, Debugging

Authors: Georgi Ganev, Meenatchi Sundaram Muthu Selva Annamalai, Emiliano De Cristofaro

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we set out to reproduce the utility evaluation from the original PATE-GAN paper, compare available implementations, and conduct a privacy audit. More precisely, we analyze and benchmark six open-source PATE-GAN implementations, including three by (a subset of) the original authors. First, we shed light on architecture deviations and empirically demonstrate that none reproduce the utility performance reported in the original paper. We then present an in-depth privacy evaluation, which includes DP auditing, and show that all implementations leak more privacy than intended.
Researcher Affiliation Collaboration Georgi Ganev EMAIL University College London and SAS Meenatchi Sundaram Muthu Selva Annamalai EMAIL University College London Emiliano De Cristofaro EMAIL University of California, Riverside
Pseudocode Yes We report its pseudo-code in Algorithm 1 in Appendix A.1. As before, D is first separated into k disjoint partitions. In each iteration, k teachers-discriminators, T1 . . . Tk, are trained on their corresponding data partition. Instead of using a public dataset, in PATE-GAN, the generator G generates samples that are labeled by the teachers as real or fake through the PATE mechanism. The student-discriminator S is then trained on these generated samples and noisy (DP) labels. Finally, the generator is trained by minimizing the loss on the student.
Open Source Code Yes Lastly, our codebase is available from: https://github.com/spalabucr/pategan-audit. To facilitate open research and robust privacy (re-)implementations, we release our codebase, including the utility benchmark and privacy auditing tools; see https://github.com/spalabucr/pategan-audit.
Open Datasets Yes In our experiments, we use all four publicly available tabular datasets (two Kaggle and two UCI) used in the original evaluation, a common image dataset (MNIST), and create a worst-case dataset as part of the DP auditing tests. We use four of the original six datasets from (Jordon et al., 2019) as the other two are not publicly available specifically, Kaggle Credit (Pozzolo et al., 2015), Kaggle Cervical Cancer (Fernandes et al., 2017), UCI ISOLET (Cole & Fanty, 1994), UCI Epileptic Seizure (Andrzejak et al., 2001). We also use MNIST (Le Cun et al., 2010), the popular digits dataset.
Dataset Splits Yes We use an 80/20 split, i.e., using 80% of the records in the datasets to train the predictive/generative models and 20% for testing.
Hardware Specification Yes Finally, all experiments are run on an AWS instance (m4.4xlarge) with a 2.4GHz Intel Xeon E5-2676 v3 (Haswell) processor, 16 v CPUs, and 64GB RAM.
Software Dependencies No The paper mentions software libraries like "scikit-learn (Pedregosa et al., 2011)" and "xgboost (Chen & Guestrin, 2016)" but does not provide specific version numbers for these libraries or the programming language used.
Experiment Setup Yes We set δ = 10^-5 (as in (Jordon et al., 2019)) and use the implementations default hyperparameters, with a couple of exceptions. First, we set the maximum number of training iterations to 10,000 to reduce computation. In our experiments, this limit is only reached for borealis and smartnoise with ϵ <= 10. Consequently, we train synthcity for a set of iterations rather than epochs. Second, for updated, we use λ = 0.001 to prevent the model from spending its privacy budget in just a few iterations. Finally, for all models, we set the number of teachers to N/1,000 following (Jordon et al., 2019), with the only exception being Kaggle Credit, where we set it to N/5,000 due to computational constraints; regardless, note that the difference in performance in the original paper (Jordon et al., 2019) is negligible. Optimizer Adam Adam RMSProp Adam Adam Adam Adam Learning Rate 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 Batch Size 64 128 64 200 128 64 64 Max Iterations 10,000 1,000 or 100 -