SubTab: Subsetting Features of Tabular Data for Self-Supervised Representation Learning

Authors: Talip Ucar, Ehsan Hajiramezanali, Lindsay Edwards

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments show that the Sub Tab achieves the state of the art (SOTA) performance of 98.31% on MNIST in tabular setting, on par with CNN-based SOTA models, and surpasses existing baselines on three other real-world datasets by a significant margin.
Researcher Affiliation Industry Talip Uçar, Ehsan Hajiramezanali, Lindsay Edwards Respiratory and Immunology, R&D, Astra Zeneca {talip.ucar, ehsan.hajiramezanali, lindsay.edwards}@astrazeneca.com
Pseudocode Yes The pseudocode of algorithm can be found in Algorithm 1 in the Appendix.
Open Source Code Yes The code for Sub Tab is provided1. 1https://github.com/AstraZeneca/SubTab
Open Datasets Yes We use five different datasets; MNIST in tabular format, the cancer genome atlas (TCGA) [42], human gut metagen-omic samples of obesity cohorts (Obesity) [36, 26], UCI adult income (Income) [24], and UCI Blog Feedback (Blog) [4].
Dataset Splits Yes MNIST: We split training set into training and validation sets (90-10% split) when searching for hyper-parameters, and then used all of training set to train the final model. The test set is used only for final evaluation. TCGA: It includes 6671 samples with 122 features, which we divided to 80-10-10% train-validation-test sets. Obesity: ...using 10 randomly drawn training-test (90-10%) splits, for each of which we used 10-fold cross-validation.
Hardware Specification No The paper does not mention any specific hardware used for its experiments, such as GPU models, CPU types, or memory.
Software Dependencies No The paper mentions using Python, PyTorch, and scikit-learn in its references, and Re LU/leaky Re LU as activation functions, but it does not specify version numbers for any of these software dependencies, which is required for reproducibility.
Experiment Setup Yes The summary of model architectures and hyperparameters are in Table A1 in the Appendix. We used a simple three-layer encoder architecture with dimensions of [512, 256, 128]. For Sub Tab, we trained our base model multiple times without noise at the input. For each training, we used different number of subsets with different levels of overlap between neighbouring subsets (Figure 3a). Gaussian noise N(0, 0.3) and masking ratio p = 0.2 works well across all models.