Modeling Tabular data using Conditional GAN
Authors: Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, Kalyan Veeramachaneni
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To evaluate these GANs, we used a group of real datasets to set-up a benchmarking system and implemented three of the most recent techniques. For comparison purposes, we created two baseline methods using Bayesian networks. After testing these models using both simulated and real datasets, we found that modeling tabular data poses unique challenges for GANs... When applied to the same datasets with the benchmarking suite, CTGAN performs significantly better than both the Bayesian network baselines and the other GANs tested, as shown in Table 1. We evaluated CLBN, Priv BN, Med GAN, Vee GAN, Table GAN, CTGAN, and TVAE using our benchmark framework. |
| Researcher Affiliation | Academia | Lei Xu MIT LIDS Cambridge, MA leix@mit.edu Maria Skoularidou MRC-BSU, University of Cambridge Cambridge, UK ms2407@cam.ac.uk Alfredo Cuesta-Infante Universidad Rey Juan Carlos Móstoles, Spain alfredo.cuesta@urjc.es Kalyan Veeramachaneni MIT LIDS Cambridge, MA kalyanv@mit.edu |
| Pseudocode | No | The paper describes network structures using mathematical formulas (e.g., 'h0 = z cond h1 = h0 Re LU(BN(FC|cond|+|z| 256(h0)))'), but does not provide structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | 1Our CTGAN model is open-sourced at https://github.com/DAI-Lab/CTGAN |
| Open Datasets | Yes | Real datasets: We picked 6 commonly used machine learning datasets from UCI machine learning repository [9], with features and label columns in a tabular form adult, census, covertype, intrusion and news. We picked credit from Kaggle. We also binarized 28 × 28 the MNIST [16] dataset and converted each sample to 784 dimensional feature vector plus one label column to mimic high dimensional binary data, called MNIST28. We resized the images to 12 × 12 and used the same process to generate a dataset we call MNIST12. |
| Dataset Splits | Yes | T is partitioned into training set Ttrain and test set Ttest. We train prediction models on Tsyn and test prediction models using Ttest. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU or CPU models, or memory amounts used for running the experiments. |
| Software Dependencies | No | The paper mentions training with 'Adam optimizer' and 'WGAN loss with gradient penalty', but does not list specific software dependencies with version numbers (e.g., 'PyTorch 1.9', 'Python 3.8'). |
| Experiment Setup | Yes | We trained each model with a batch size of 500. Each model is trained for 300 epochs. Each epoch contains N/batch_size steps where N is the number of rows in the training set. We use Adam optimizer with learning rate 2 × 10−4. TVAE is trained using Adam with learning rate 1e-3. |