Making Pre-trained Language Models Great on Tabular Prediction

Authors: Jiahuan Yan, Bo Zheng, Hongxia Xu, Yiheng Zhu, Danny Chen, Jimeng Sun, Jian Wu, Jintai Chen

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Comprehensive experiments demonstrate that our pre-trained TP-BERTa leads the performance among tabular DNNs and is competitive with Gradient Boosted Decision Tree models in typical tabular data regime.
Researcher Affiliation Academia Jiahuan Yan1,2, , Bo Zheng2, Hongxia Xu2, Yiheng Zhu2, Danny Z. Chen3, Jimeng Sun4, Jian Wu1,2, , Jintai Chen4, 1The Second Affiliated Hospital Zhejiang University School of Medicine 2Zhejiang University 3University of Notre Dame 4University of Illinois at Urbana-Champaign
Pseudocode No The paper provides mathematical formulations and a workflow illustration (Fig. 1) but does not include explicit pseudocode blocks or algorithms labeled as such.
Open Source Code Yes Codes are available at https://github.com/jyansir/tp-berta.
Open Datasets Yes We leverage the high-quality large semantic tabular database TabPertNet (Ye et al., 2024). In total, our pre-training datasets consist of 101 binary classification datasets and 101 regression datasets with about 10 million samples, and our downstream datasets consist of 80 binary classification datasets and 65 regression datasets. Detailed dataset statistics are provided in Appendix B.
Dataset Splits Yes We split each finetune dataset ((64%, 16%, 20%) for training, validation, and testing separately), and keep the same label distribution in each split on binary classification. Because the massive LM is likely to overfit a single dataset, we use 5% of the training data as the validation set.
Hardware Specification Yes Pre-training is conducted on four NVIDIA A100 Tensor Core GPUs, with a total batch size of 512 per step. All the models are finetuned on NVIDIA RTX 3090.
Software Dependencies Yes We implement our TP-BERTa with PyTorch and the Hugging Face Transformers package on Python 3.8. Pre-training is conducted with PyTorch version 1.9.0, CUDA version 11.3, and Hugging Face Transformers package version 4.18.0
Experiment Setup Yes We use a total of 30 training epochs, with a linear warm-up for the first 6% of steps, followed by a linear decay to 0. The best checkpoint is saved by the average validation loss over all the datasets. We keep a constant weight λ = 0.1 in pretraining. In training, we uniformly use a training batch size of 64 for all the DNNs. For the other DNNs, the optimizer is AdamW (Loshchilov & Hutter, 2019) with the default configuration except for the learning rate and weight decay rate. For TP-BERTa... we keep the default hyperparameters of 1e-5 learning rate without weight decay.