reproducibilityindex.ai

TabDDPM: Modelling Tabular Data with Diffusion Models

Authors: Akim Kotelnikov, Dmitry Baranchuk, Ivan Rubachev, Artem Babenko

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We extensively evaluate Tab DDPM on a wide set of benchmarks and demonstrate its superiority over existing GAN/VAE alternatives, which is consistent with the advantage of diffusion models in other fields. The source code of Tab DDPM is available at Git Hub.
Researcher Affiliation	Collaboration	1HSE university, Moscow, Russia 2Yandex, Moscow, Russia.
Pseudocode	No	The paper uses mathematical equations and textual descriptions to explain the model and processes, but does not include structured pseudocode or algorithm blocks.
Open Source Code	Yes	The source code of Tab DDPM is available at Git Hub.
Open Datasets	Yes	For systematic investigation of the performance of tabular generative models, we consider a diverse set of 15 real-world public datasets. These datasets have various sizes, nature, number of features, and their distributions. Most datasets were previously used for tabular model evaluation in (Zhao et al., 2021; Gorishniy et al., 2021). Appendix F: Abalone (Open ML), Adult (income estimation, (Kohavi, 1996))
Dataset Splits	Yes	Table 2. Details on the datasets used in the evaluation. Abbr Name # Train # Validation # Test # Num # Cat Task type AB Abalone 2672 669 836 7 1 Regression
Hardware Specification	Yes	Experiments were conducted under Ubuntu 20.04 on a machine equipped with Ge Force RTX 2080 Ti GPU and Intel(R) Core(TM) i7-7800X CPU @ 3.50GHz.
Software Dependencies	Yes	We used Pytorch 10.1, CUDA 11.3, scikit-learn 1.1.2 and imbalanced-learn 0.9.1 (for SMOTE).
Experiment Setup	Yes	Table 1. The list of main hyperparameters for Tab DDPM. Hyperparameter Search space Learning rate Log Uniform[0.00001, 0.003] Batch size Cat{256, 4096} Diffusion timesteps Cat{100, 1000} Training iterations Cat{5000, 10000, 20000} # MLP layers Int{2, 4, 6, 8} MLP width of layers Int{128, 256, 512, 1024} Proportion of samples Float{0.25, 0.5, 1, 2, 4, 8} Dropout 0.0 Scheduler cosine (Nichol, 2021) Gaussian diffusion loss MSE