Modeling Tabular data using Conditional GAN

Authors: Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, Kalyan Veeramachaneni

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To evaluate these GANs, we used a group of real datasets to set-up a benchmarking system and implemented three of the most recent techniques. For comparison purposes, we created two baseline methods using Bayesian networks. After testing these models using both simulated and real datasets, we found that modeling tabular data poses unique challenges for GANs... When applied to the same datasets with the benchmarking suite, CTGAN performs significantly better than both the Bayesian network baselines and the other GANs tested, as shown in Table 1. We evaluated CLBN, Priv BN, Med GAN, Vee GAN, Table GAN, CTGAN, and TVAE using our benchmark framework.
Researcher Affiliation Academia Lei Xu MIT LIDS Cambridge, MA leix@mit.edu Maria Skoularidou MRC-BSU, University of Cambridge Cambridge, UK ms2407@cam.ac.uk Alfredo Cuesta-Infante Universidad Rey Juan Carlos Móstoles, Spain alfredo.cuesta@urjc.es Kalyan Veeramachaneni MIT LIDS Cambridge, MA kalyanv@mit.edu
Pseudocode No The paper describes network structures using mathematical formulas (e.g., 'h0 = z cond h1 = h0 Re LU(BN(FC|cond|+|z| 256(h0)))'), but does not provide structured pseudocode or algorithm blocks.
Open Source Code Yes 1Our CTGAN model is open-sourced at https://github.com/DAI-Lab/CTGAN
Open Datasets Yes Real datasets: We picked 6 commonly used machine learning datasets from UCI machine learning repository [9], with features and label columns in a tabular form adult, census, covertype, intrusion and news. We picked credit from Kaggle. We also binarized 28 × 28 the MNIST [16] dataset and converted each sample to 784 dimensional feature vector plus one label column to mimic high dimensional binary data, called MNIST28. We resized the images to 12 × 12 and used the same process to generate a dataset we call MNIST12.
Dataset Splits Yes T is partitioned into training set Ttrain and test set Ttest. We train prediction models on Tsyn and test prediction models using Ttest.
Hardware Specification No The paper does not provide specific hardware details such as GPU or CPU models, or memory amounts used for running the experiments.
Software Dependencies No The paper mentions training with 'Adam optimizer' and 'WGAN loss with gradient penalty', but does not list specific software dependencies with version numbers (e.g., 'PyTorch 1.9', 'Python 3.8').
Experiment Setup Yes We trained each model with a batch size of 500. Each model is trained for 300 epochs. Each epoch contains N/batch_size steps where N is the number of rows in the training set. We use Adam optimizer with learning rate 2 × 10−4. TVAE is trained using Adam with learning rate 1e-3.