Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

TabSTAR: A Tabular Foundation Model for Tabular Data with Text Fields

Authors: Alan Arazi, Eilam Shapira, Roi Reichart

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, Tab STAR achieves state-of-the-art (SOTA) performance on classification datasets containing textual features, surpassing leading TFMs as well as GBDTs tuned for 4 hours.1
Researcher Affiliation	Academia	Alan Arazi Eilam Shapira Roi Reichart EMAIL Technion IIT
Pseudocode	No	Figure 1: The Tab STAR architecture illustrated with our toy dataset. The model processes numerical features, textual features, and all possible target values for classification. The architecture is described in prose in Section 3 and Appendix A, rather than structured pseudocode blocks.
Open Source Code	Yes	Code is available at https://github.com/alanarazi7/Tab STAR.
Open Datasets	Yes	We manually curate a pretraining corpus of 350 high-quality tabular datasets (253 classification, 97 regression), in a tedious process in which we uncover numerous duplications in the most popular tabular repositories, Open ML [78] and Kaggle,10 as elaborated by [76].
Dataset Splits	Yes	Each of the 50 datasets in the benchmark is evaluated with 10 random train-test splits (90% training, 10% testing), resulting in 500 runs per model. (...) We split each dataset into train-validation splits (95%-5%),23 without any need for test splits, and cap the validation set at a maximum of 1,000 examples used for evaluating pretraining performance. (...) We apply a train-test split of 90%-10% and sample a validation set of 10%.
Hardware Specification	Yes	Hardware All baselines are evaluated using a single NVIDIA A100-SXM4 GPU with 40GB memory, and 8 CPU cores of type AMD EPYC 7742 64-Core Processor.
Software Dependencies	No	We run Cat Boost using the catboost package37 and run the default configuration suggested by [26] by setting early_stopping_rounds = 50, od_pval = 0.001, iterations = 2000. (...) We run XGBoost using the xgboost package.38 (...) We run Light GBM using the lightgbm package.39 (...) We run it with the sklearn package 40. While specific software packages are mentioned, explicit version numbers are not consistently provided for all dependencies, except for the Tab PFN-v2 API client v2.0.8, which is an external dependency rather than part of the core software used for Tab STAR's implementation.
Experiment Setup	Yes	We finetune downstream tasks using Lo RA s implementation of the peft package24. We use a rank of r = 32, set α = 2r = 64 and dropout = 0.1. We employ the same scheduler as in the pretraining phase, with the only difference being that we set lr = 0.001, and increase the patience parameter for early stopping to 5. (...) We pretrain for 50 epochs with the One Cycle LR [67] optimizer, with warmup during the first 5 epochs (10%) and cosine annealing. Early stopping is conducted after 3 epochs without improvement on the pretraining metric. The weight decay is set to 0.001, and a max learning rate of lr = 5 10 5 is applied uniformly across all layers.