TabEBM: A Tabular Data Augmentation Method with Distinct Class-Specific Energy-Based Models

Authors: Andrei Margeloiu, Xiangjian Jiang, Nikola Simidjievski, Mateja Jamnik

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments show that Tab EBM generates synthetic data with higher quality and better statistical fidelity than existing methods. When used for data augmentation, our synthetic data consistently leads to improved classification performance across diverse datasets of various sizes, especially small ones.
Researcher Affiliation Academia Department of Computer Science and Technology, University of Cambridge, UK PBCI, Department of Oncology, University of Cambridge, UK
Pseudocode Yes Algorithm 1 Tab EBM sampling from Class-Specific EBM Ec(x)
Open Source Code Yes Code is available at https://github.com/andreimargeloiu/Tab EBM.
Open Datasets Yes We utilise eight open-source tabular datasets from Open ML [7] across five domains: Medicine, Chemistry, Engineering, Language and Economics. As Tab PFN utilises many small-size Open ML datasets in its meta-validation [33], it can lead to data leakage when evaluating Tab EBM. Therefore, to provide fair comparisons, we select six additional leakage-free datasets from UCI [22].
Dataset Splits Yes We split each subset into stratified training and validation sets with a ratio of 4:1. We provide detailed descriptions of data splitting in Appendix B.2 and preprocessing in Appendix B.3.
Hardware Specification Yes All our experiments are run on a single machine from an internal cluster with a GPU Nvidia Quadro RTX 8000 with 48GB memory and an Intel(R) Xeon(R) Gold 5218 CPU with 16 cores (at 2.30GHz).
Software Dependencies Yes We implemented Tab EBM using Py Torch 1.13 [64], an open-source deep learning library with a BSD licence. We implemented SMOTE with Imbalancedlearn [45], an open-source Python library for imbalanced datasets with an MIT licence. For other benchmark generators, we used their open-source implementations in Synthcity [67], a library for generating and evaluating synthetic tabular data with an Apache-2.0 license. (ii) For downstream predictors: We implemented Tab PFN with its open-source implementation (https://github.com/automl/Tab PFN). We implemented the other five downstream predictors (i.e., Logistic Regression, KNN, MLP, Random Forest and XGBoost) with their open-source implementation in scikit-learn [65], an open-source Python library under the 3-Clause BSD license. (iii) For result analysis and visualisation: All numerical plots and graphics have been generated using Matplotlib 3.7 [34], a Python-based plotting library with a BSD licence.
Experiment Setup Yes In all our experiments, the surrogate binary classifier in Tab EBM is a pretrained in-context model, Tab PFN [33], using the official model weights released by the authors (https://github.com/automl/Tab PFN/raw/main/tabpfn/models_diff/prior_diff_real_checkpoint_n_0_epoch_42.cpkt). We use Tab PFN with three ensembles. We use four surrogate negative samples, X neg c , positioned at αneg dist = 5 standard deviations from zero, in random corners of a hypercube in RD (as explained in Section 2.2), distant from any real data. In Appendix D.1, we show that Tab EBM is robust to the distribution of the negative samples. We use SGLD [84] for sampling from Tab EBM, where the starting points xsynth 0 are initialised by adding Gaussian noise with zero mean and standard deviation σstart = 0.01 to a randomly selected sample of the specific class, i.e., xsynth 0 N(Xc, σ2 start I). For SGLD, we used the following parameters: step size αstep = 0.1, noise scale αnoise = 0.01 and number of steps T = 200.