TabDDPM: Modelling Tabular Data with Diffusion Models
Authors: Akim Kotelnikov, Dmitry Baranchuk, Ivan Rubachev, Artem Babenko
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We extensively evaluate Tab DDPM on a wide set of benchmarks and demonstrate its superiority over existing GAN/VAE alternatives, which is consistent with the advantage of diffusion models in other fields. The source code of Tab DDPM is available at Git Hub. |
| Researcher Affiliation | Collaboration | 1HSE university, Moscow, Russia 2Yandex, Moscow, Russia. |
| Pseudocode | No | The paper uses mathematical equations and textual descriptions to explain the model and processes, but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The source code of Tab DDPM is available at Git Hub. |
| Open Datasets | Yes | For systematic investigation of the performance of tabular generative models, we consider a diverse set of 15 real-world public datasets. These datasets have various sizes, nature, number of features, and their distributions. Most datasets were previously used for tabular model evaluation in (Zhao et al., 2021; Gorishniy et al., 2021). Appendix F: Abalone (Open ML), Adult (income estimation, (Kohavi, 1996)) |
| Dataset Splits | Yes | Table 2. Details on the datasets used in the evaluation. Abbr Name # Train # Validation # Test # Num # Cat Task type AB Abalone 2672 669 836 7 1 Regression |
| Hardware Specification | Yes | Experiments were conducted under Ubuntu 20.04 on a machine equipped with Ge Force RTX 2080 Ti GPU and Intel(R) Core(TM) i7-7800X CPU @ 3.50GHz. |
| Software Dependencies | Yes | We used Pytorch 10.1, CUDA 11.3, scikit-learn 1.1.2 and imbalanced-learn 0.9.1 (for SMOTE). |
| Experiment Setup | Yes | Table 1. The list of main hyperparameters for Tab DDPM. Hyperparameter Search space Learning rate Log Uniform[0.00001, 0.003] Batch size Cat{256, 4096} Diffusion timesteps Cat{100, 1000} Training iterations Cat{5000, 10000, 20000} # MLP layers Int{2, 4, 6, 8} MLP width of layers Int{128, 256, 512, 1024} Proportion of samples Float{0.25, 0.5, 1, 2, 4, 8} Dropout 0.0 Scheduler cosine (Nichol, 2021) Gaussian diffusion loss MSE |