Debiasing Synthetic Data Generated by Deep Generative Models

Authors: Alexander Decruyenaere, Heidelinde Dehaene, Paloma Rabaey, Johan Decruyenaere, Christiaan Polet, Thomas Demeester, Stijn Vansteelandt

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We exemplify our proposal through a simulation study on toy data and two case studies on real-world data, highlighting the importance of tailoring DGMs for targeted data analysis.
Researcher Affiliation Collaboration Alexander Decruyenaere Ghent University Hospital SYNDARA Heidelinde Dehaene Ghent University Hospital SYNDARA Paloma Rabaey Ghent University imec Christiaan Polet Ghent University Hospital SYNDARA Johan Decruyenaere Ghent University Hospital SYNDARA Thomas Demeester Ghent University imec Stijn Vansteelandt Ghent University Department of Applied Mathematics, Computer Science and Statistics
Pseudocode Yes Algorithm 1: Data generating process for hypothetical disease.
Open Source Code Yes Our code is available on Github: https://github.com/syndara-lab/debiased-generation.
Open Datasets Yes International Stroke Trial (IST) dataset (Sandercock et al., 2011)
Dataset Splits No The paper describes generating synthetic data and evaluating its utility, but does not specify a distinct 'validation' dataset split of the original data in the conventional machine learning sense. While methods like k-fold cross-fitting are mentioned, they are not explicitly defined as a separate validation set split used for hyperparameter tuning in the main experimental setup.
Hardware Specification Yes All experiments were run on our institutional high performance computing cluster using a single GPU (NVIDIA Ampere A100; 80GB GPU memory) and single CPU (AMD EPYC 7413)
Software Dependencies No The paper mentions software packages like 'Synthcity' and 'SDV' and implicitly 'Python' but does not provide specific version numbers for these tools or any other software libraries required for reproducibility.
Experiment Setup Yes The DGMs were trained using the default hyperparameters as suggested by the package Synthcity (Qian et al., 2023). We also show results obtained for other hyperparameters (the default in the package SDV (Patki et al., 2016)) in Appendix A.7.4. A comparison of the default hyperparameters in both packages is provided in Tables A1 and A2.