SwitchTab: Switched Autoencoders Are Effective Tabular Learners

Authors: Jing Wu, Suiyao Chen, Qi Zhao, Renat Sergazinov, Chen Li, Shengjie Liu, Chongchao Zhao, Tianpei Xie, Hanqing Guo, Cheng Ji, Daniel Cociorva, Hakan Brunzell

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To validate the effectiveness of Switch Tab, we conduct extensive experiments across various domains involving tabular data. The results showcase superior performance in end-to-end prediction tasks with fine-tuning. Moreover, we demonstrate that pre-trained salient embeddings can be utilized as plug-and-play features to enhance the performance of various traditional classification methods (e.g., Logistic Regression, XGBoost, etc.). Lastly, we highlight the capability of Switch Tab to create explainable representations through visualization of decoupled mutual and salient features in the latent space.
Researcher Affiliation Industry Jing Wu*, Suiyao Chen*, Qi Zhao, Renat Sergazinov, Chen Li, Shengjie Liu, Chongchao Zhao, Tianpei Xie, Hanqing Guo, Cheng Ji, Daniel Cociorva, Hakan Brunzell Amazon Buyer Risk Prevention 4575 La Jolla Village Dr San Diego, California 92122 USA {jingwua, suiyaoc, qqzhao, renserg, chenlii, zycjlsj, zchongch, lukexie, hanqiguo, cjiamzn, cociorva, brunzell}@amazon.com
Pseudocode Yes Algorithm 1: Self-supervised Learning with Switch Tab
Open Source Code No The paper does not include any explicit statements about releasing source code or provide a link to a code repository.
Open Datasets Yes We first evaluate the performance of Switch Tab on a standard benchmark from (Gorishniy et al. 2021). Concretely, the datasets include: California Housing (CA) (Pace and Barry 1997), Adult (AD) (Kohavi et al. 1996), Helena (HE) (Guyon et al. 2019b), Jannis (JA) (Guyon et al. 2019b), Higgs (HI) (Baldi, Sadowski, and Whiteson 2014), ALOI (AL) (Geusebroek, Burghouts, and Smeulders 2005), Epsilon (EP) (Yuan, Ho, and Lin 2011), Year (YE) (Bertin Mahieux et al. 2011), Covertype (CO) (Blackard and Dean 1999), Yahoo (YA) (Chapelle and Chang 2011), Microsoft (MI) (Qin and Liu 2013). Besides the standard benchmarks, there is also another set of popular datasets used by recent work (Somepalli et al. 2021), including Bank (BK) (Moro, Cortez, and Rita 2014), Blastchar (BC) (Ouk, Dada, and Kang 2018), Arrhythmia (AT) (Liu, Ting, and Zhou 2008; Ouk, Dada, and Kang 2018), Arcene (AR) (Asuncion and Newman 2007), Shoppers (SH) (Sakar et 2019), Volkert (VO) (Guyon et al. 2019a) and MNIST (MN) (Xiao, Rasul, and Vollgraf 2017).
Dataset Splits No The paper mentions using standard benchmarks and fine-tuning results according to established paradigms, but it does not explicitly provide specific training/validation/test split percentages or sample counts for its experiments.
Hardware Specification No The paper does not provide specific details about the hardware used for running the experiments (e.g., GPU/CPU models, memory, or cloud instance types).
Software Dependencies No The paper mentions using the RMSprop and Adam optimizers, but it does not specify any software versions (e.g., programming language versions, specific machine learning libraries, or other dependencies with version numbers).
Experiment Setup Yes For feature corruption, we uniformly sample a subset of features for each sample to generate a corrupted view at a fixed corruption ratio of 0.3. For the encoder f, we employ a three-layer transformer with two heads. Both projectors ps and pm consist of one linear layer, followed by a sigmoid activation function. Additionally, the decoder d remains a one-layer network with a sigmoid activation function. For all the pre-training, we train all models for 1000 epochs with the default batch size of 128. We use the RMSprop optimizer (Hinton, Srivastava, and Swersky 2012) with an initial learning rate set as 0.0003. During the fine-tuning stage, we set the maximum epochs as 200. Adam optimizer with a learning rate of 0.001 is used.