CuTS: Customizable Tabular Synthetic Data Generation

Authors: Mark Vero, Mislav Balunovic, Martin Vechev

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate Cu TS over four datasets and on numerous custom specifications, outperforming state-of-the-art specialized approaches on several tasks while being more general. In our experimental evaluation, we demonstrate that Cu TS produces synthetic data according to a number of custom specifications unsupported by prior work, while achieving high utility.
Researcher Affiliation Academia Mark Vero 1 Mislav Balunovi c 1 Martin Vechev 1 1Department of Computer Science, ETH Zurich, Switzerland.
Pseudocode Yes Algorithm 1 Cu TS Privacy Budget Annealing
Open Source Code Yes We provide an implementation of Cu TS under: https://github.com/ eth-sri/cuts/.
Open Datasets Yes We evaluate our method on four popular tabular datasets: Adult (Dua & Graff, 2017), German Credit (Dua & Graff, 2017), Compas (Angwin et al., 2016), and the Health Heritage Prize dataset from Kaggle (Kaggle, 2023).
Dataset Splits Yes The regularization parameters are selected on a hold-out validation dataset. For choosing the constraint weights {λi}n i=1, we implemented a k-fold cross-validation scheme splitting over the reference dataset of the fine-tuning objective. The UCI Adult Census dataset (Dua & Graff, 2017) contains US-census data of 45 222 individuals (excluding incomplete rows), split into training and test sets of size 30 162 and 15 060, respectively.
Hardware Specification Yes For running the experiments we had 7 NVIDIA Ge Force RTX 2080 Ti GPUs, 4 NVIDIA TITAN RTX, 2 NVIDIA Ge Force GTX 1080 Ti, and 2 NVIDIA A100 SXM 40GB Tensor Core GPUs available, where the A100 cards were used only for the experiments on Health Heritage.
Software Dependencies No For measuring the utility of the dataset using the XGB accuracy metric, we use an XGBoost classifier with the default hyperparameters, as included in the XGBoost Python library. We also use the Adam optimizer, with the default parameters. No specific version numbers for these software components are provided.
Experiment Setup Yes In each of our experiments the base architecture of the Cu TS generative model gθ is formed by a four-layer fully connected neural network with residual connections, where the first hidden layer contains 100 neurons, and the rest of the layers 200. The input dimension of the network, i.e., the dimension of the sampled Gaussian noise z, is 100. For pre-training the non-private model, we use batch size 15 000 (i.e., the generated dataset we measure the marginals of has 15 000 rows), and train the model for 2 000 epochs.