TabDDPM: Modelling Tabular Data with Diffusion Models

Authors: Akim Kotelnikov, Dmitry Baranchuk, Ivan Rubachev, Artem Babenko

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We extensively evaluate Tab DDPM on a wide set of benchmarks and demonstrate its superiority over existing GAN/VAE alternatives, which is consistent with the advantage of diffusion models in other fields. The source code of Tab DDPM is available at Git Hub.
Researcher Affiliation Collaboration 1HSE university, Moscow, Russia 2Yandex, Moscow, Russia.
Pseudocode No The paper uses mathematical equations and textual descriptions to explain the model and processes, but does not include structured pseudocode or algorithm blocks.
Open Source Code Yes The source code of Tab DDPM is available at Git Hub.
Open Datasets Yes For systematic investigation of the performance of tabular generative models, we consider a diverse set of 15 real-world public datasets. These datasets have various sizes, nature, number of features, and their distributions. Most datasets were previously used for tabular model evaluation in (Zhao et al., 2021; Gorishniy et al., 2021). Appendix F: Abalone (Open ML), Adult (income estimation, (Kohavi, 1996))
Dataset Splits Yes Table 2. Details on the datasets used in the evaluation. Abbr Name # Train # Validation # Test # Num # Cat Task type AB Abalone 2672 669 836 7 1 Regression
Hardware Specification Yes Experiments were conducted under Ubuntu 20.04 on a machine equipped with Ge Force RTX 2080 Ti GPU and Intel(R) Core(TM) i7-7800X CPU @ 3.50GHz.
Software Dependencies Yes We used Pytorch 10.1, CUDA 11.3, scikit-learn 1.1.2 and imbalanced-learn 0.9.1 (for SMOTE).
Experiment Setup Yes Table 1. The list of main hyperparameters for Tab DDPM. Hyperparameter Search space Learning rate Log Uniform[0.00001, 0.003] Batch size Cat{256, 4096} Diffusion timesteps Cat{100, 1000} Training iterations Cat{5000, 10000, 20000} # MLP layers Int{2, 4, 6, 8} MLP width of layers Int{128, 256, 512, 1024} Proportion of samples Float{0.25, 0.5, 1, 2, 4, 8} Dropout 0.0 Scheduler cosine (Nichol, 2021) Gaussian diffusion loss MSE