ClavaDDPM: Multi-relational Data Synthesis with Cluster-guided Diffusion Models

Authors: Wei Pang, Masoumeh Shafieinejad, Lucy Liu, Stephanie Hazlewood, Xi He

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive evaluations on multi-table datasets of varying sizes show that Clava DDPM significantly outperforms existing methods for these long-range dependencies while remaining competitive on utility metrics for single-table data.
Researcher Affiliation Collaboration Wei Pang1,2, Masoumeh Shafieinejad2, Lucy Liu3, Stephanie Hazlewood3, and Xi He 1,2 1University of Waterloo 2Vector Institute 3Royal Bank of Canada
Pseudocode Yes Algorithm 1 Clava DDPM: Latent learning and table augmentation. Algorithm 2 Clava DDPM: Training Algorithm 3 Clava DDPM: Synthesis
Open Source Code Yes We upload supplementary materials including code for reproducibility.
Open Datasets Yes We experiment with five real-world multi-relational datasets including California [6], Instacart 05 [23], Berka [4], Movie Lens [39, 32], and CCS [32].
Dataset Splits No The paper mentions a train-test split for MLE evaluation but does not specify a separate validation split for model training/tuning across all experiments.
Hardware Specification Yes All experiments are conducted with an NVIDIA A6000 GPU and 32 CPU cores, with a time limit of 7 days.
Software Dependencies No The paper mentions software like MLP, Adam W optimizer, CTGAN, Tab DDPM, and SDV library, but does not provide specific version numbers for these dependencies.
Experiment Setup Yes We perform a comprehensive ablation study using Berka (for it has the most complex multi-table structure) on each component of Clava DDPM and provide empirical tuning suggestions. The full results are in Table 2.