XTab: Cross-table Pretraining for Tabular Transformers

Authors: Bingzhao Zhu, Xingjian Shi, Nick Erickson, Mu Li, George Karypis, Mahsa Shoaran

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Tested on 84 tabular prediction tasks from the Open ML-Auto ML Benchmark (AMLB), we show that (1) XTab consistently boosts the generalizability, learning speed, and performance of multiple tabular transformers, (2) by pretraining FT-Transformer via XTab, we achieve superior performance than other state-of-the-art tabular deep learning models on various tasks such as regression, binary, and multiclass classification. To verify our design, we conducted extensive experiments on Auto ML Benchmark (AMLB).
Researcher Affiliation Collaboration 1EPFL, Lausanne, Switzerland 2Cornell University, Ithaca, USA 3Boson AI, USA 4Amazon Web Services, USA. Correspondence to: Bingzhao Zhu <bz323@cornell.edu>.
Pseudocode No We use the Federated Averaging (Fed Avg) algorithm to pretrain XTab (Mc Mahan et al., 2017; Li et al., 2019).
Open Source Code Yes The code and sample pretrained checkpoints are attached to https://github.com/BingzhaoZhu/XTab.
Open Datasets Yes We use the public Open ML-Auto ML Benchmark (AMLB: openml.github.io/automlbenchmark/) (Gijsbers et al., 2022) for pretraining and evaluation.
Dataset Splits Yes Over the remaining data, we randomly partition 87.5% (7/8) into the training set and use 12.5% (1/8) for validation.
Hardware Specification Yes Both pretraining and finetuning were performed on a cloud cluster of NVIDIA T4 GPUs (16 GB memory).
Software Dependencies No We used the Auto Gluon neural networks implemented on top of Py Torch (Erickson et al., 2020) and Both stages use Adam W as the optimizer. No specific version numbers are provided for these software components.
Experiment Setup Yes Our default model configuration of transformer variants is the same as Gorishniy et al. (2021), with 3 transformer blocks, a feature embedding size of 192 and 8 attention heads. The feed forward networks (Figure 1) have two layers with the same size as the embedding. We apply a dropout ratio of 20% to attention layers and 10% for feed forward networks. We use Re GLU (Shazeer, 2020) as the activation function and layer normalization (Ba et al., 2016) in the feed forward layers. The projection heads are Re LU networks with 2 layers and a hidden dimension of 192. All model components use Kaiming initialization (He et al., 2015) with the bias terms fixed at zeros. The batch size is fixed at 128 for both pretraining and finetuning. Both stages use Adam W as the optimizer, with a learning rate of 1e-4. Following Gorishniy et al. (2021); Rubachev et al., (2022), we also apply a weight decay of 1e-5 to all components excluding featurizers, [CLS] tokens, layer normalization and bias terms.