reproducibilityindex.ai

SubTab: Subsetting Features of Tabular Data for Self-Supervised Representation Learning

Authors: Talip Ucar, Ehsan Hajiramezanali, Lindsay Edwards

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments show that the Sub Tab achieves the state of the art (SOTA) performance of 98.31% on MNIST in tabular setting, on par with CNN-based SOTA models, and surpasses existing baselines on three other real-world datasets by a signiﬁcant margin.
Researcher Affiliation	Industry	Talip Uçar, Ehsan Hajiramezanali, Lindsay Edwards Respiratory and Immunology, R&D, Astra Zeneca {talip.ucar, ehsan.hajiramezanali, lindsay.edwards}@astrazeneca.com
Pseudocode	Yes	The pseudocode of algorithm can be found in Algorithm 1 in the Appendix.
Open Source Code	Yes	The code for Sub Tab is provided1. 1https://github.com/AstraZeneca/SubTab
Open Datasets	Yes	We use ﬁve different datasets; MNIST in tabular format, the cancer genome atlas (TCGA) [42], human gut metagen-omic samples of obesity cohorts (Obesity) [36, 26], UCI adult income (Income) [24], and UCI Blog Feedback (Blog) [4].
Dataset Splits	Yes	MNIST: We split training set into training and validation sets (90-10% split) when searching for hyper-parameters, and then used all of training set to train the ﬁnal model. The test set is used only for ﬁnal evaluation. TCGA: It includes 6671 samples with 122 features, which we divided to 80-10-10% train-validation-test sets. Obesity: ...using 10 randomly drawn training-test (90-10%) splits, for each of which we used 10-fold cross-validation.
Hardware Specification	No	The paper does not mention any specific hardware used for its experiments, such as GPU models, CPU types, or memory.
Software Dependencies	No	The paper mentions using Python, PyTorch, and scikit-learn in its references, and Re LU/leaky Re LU as activation functions, but it does not specify version numbers for any of these software dependencies, which is required for reproducibility.
Experiment Setup	Yes	The summary of model architectures and hyperparameters are in Table A1 in the Appendix. We used a simple three-layer encoder architecture with dimensions of [512, 256, 128]. For Sub Tab, we trained our base model multiple times without noise at the input. For each training, we used different number of subsets with different levels of overlap between neighbouring subsets (Figure 3a). Gaussian noise N(0, 0.3) and masking ratio p = 0.2 works well across all models.