reproducibilityindex.ai

XTab: Cross-table Pretraining for Tabular Transformers

Authors: Bingzhao Zhu, Xingjian Shi, Nick Erickson, Mu Li, George Karypis, Mahsa Shoaran

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Tested on 84 tabular prediction tasks from the Open ML-Auto ML Benchmark (AMLB), we show that (1) XTab consistently boosts the generalizability, learning speed, and performance of multiple tabular transformers, (2) by pretraining FT-Transformer via XTab, we achieve superior performance than other state-of-the-art tabular deep learning models on various tasks such as regression, binary, and multiclass classification. To verify our design, we conducted extensive experiments on Auto ML Benchmark (AMLB).
Researcher Affiliation	Collaboration	1EPFL, Lausanne, Switzerland 2Cornell University, Ithaca, USA 3Boson AI, USA 4Amazon Web Services, USA. Correspondence to: Bingzhao Zhu <bz323@cornell.edu>.
Pseudocode	No	We use the Federated Averaging (Fed Avg) algorithm to pretrain XTab (Mc Mahan et al., 2017; Li et al., 2019).
Open Source Code	Yes	The code and sample pretrained checkpoints are attached to https://github.com/BingzhaoZhu/XTab.
Open Datasets	Yes	We use the public Open ML-Auto ML Benchmark (AMLB: openml.github.io/automlbenchmark/) (Gijsbers et al., 2022) for pretraining and evaluation.
Dataset Splits	Yes	Over the remaining data, we randomly partition 87.5% (7/8) into the training set and use 12.5% (1/8) for validation.
Hardware Specification	Yes	Both pretraining and finetuning were performed on a cloud cluster of NVIDIA T4 GPUs (16 GB memory).
Software Dependencies	No	We used the Auto Gluon neural networks implemented on top of Py Torch (Erickson et al., 2020) and Both stages use Adam W as the optimizer. No specific version numbers are provided for these software components.
Experiment Setup	Yes	Our default model configuration of transformer variants is the same as Gorishniy et al. (2021), with 3 transformer blocks, a feature embedding size of 192 and 8 attention heads. The feed forward networks (Figure 1) have two layers with the same size as the embedding. We apply a dropout ratio of 20% to attention layers and 10% for feed forward networks. We use Re GLU (Shazeer, 2020) as the activation function and layer normalization (Ba et al., 2016) in the feed forward layers. The projection heads are Re LU networks with 2 layers and a hidden dimension of 192. All model components use Kaiming initialization (He et al., 2015) with the bias terms fixed at zeros. The batch size is fixed at 128 for both pretraining and finetuning. Both stages use Adam W as the optimizer, with a learning rate of 1e-4. Following Gorishniy et al. (2021); Rubachev et al., (2022), we also apply a weight decay of 1e-5 to all components excluding featurizers, [CLS] tokens, layer normalization and bias terms.