Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Mitra: Mixed Synthetic Priors for Enhancing Tabular Foundation Models
Authors: Xiyuan Zhang, Danielle Maddix Robinson, Junming Yin, Nick Erickson, Abdul Fatir Ansari, Boran Han, Shuai Zhang, Leman Akoglu, Christos Faloutsos, Michael W. Mahoney, Tony Hu, Huzefa Rangwala, George Karypis, Yuyang (Bernie) Wang
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we investigate these questions, with the goal of identifying key properties of synthetic priors used for pretraining TFMs. Our findings sharpen a vague rule of thumb that diversity of the prior is important. We show that the effectiveness of a synthetic prior depends on: (1) the performance of a TFM pretrained solely on data generated from that prior, when evaluated on real tabular data; (2) its diversity, i.e., how difficult it is for a TFM pretrained on this prior to overfit on its own distribution; and (3) distinctiveness within a mixture of priors, i.e., how hard it is for data generated from this prior to be predicted by TFMs pretrained on other priors. See Figure 1 for a simplified illustration. 4 Empirical Results In this section, we show that MITRA achieves SOTA performance on both classification and regression tasks (Section 4.2). We demonstrate that MITRA is model agnostic and consistently improves the performance with both 1D attention (MITRA 1D) and 2D attention architectures (Section 4.3). Furthermore, we highlight MITRA s better sample efficiency (Section 4.4), strong performance when combined with advanced ensembling techniques (Section 4.5), and strong fine-tuning performance (Section 4.6). We also conduct an ablation study to quantify the importance of each prior (Section 4.7), which supports our findings in Section 3.2. Finally, we analyze the scaling law with respect to both model size and synthetic dataset size (Section 4.8). See Appendix B for additional experimental setting details and Appendix C for additional experimental results. |
| Researcher Affiliation | Collaboration | Xiyuan Zhang Amazon Danielle C. Maddix Amazon Junming Yin Amazon Nick Erickson Amazon Abdul Fatir Ansari Amazon Boran Han Amazon Shuai Zhang Amazon Leman Akoglu Amazon and CMU Christos Faloutsos Amazon and CMU Michael W. Mahoney Amazon Cuixiong Hu Amazon Huzefa Rangwala Amazon George Karypis Amazon Bernie Wang Amazon |
| Pseudocode | Yes | Algorithm 1 MITRA: Mixture of Synthetic Priors for pretraining TFMs Algorithm 2 Indirectly Sampled TBPs. Algorithm 3 Target Generation From DT Traversal Algorithm 4 Construct Random Decision Tree (DT) Algorithm 5 Directly Sampled Random Forest (DSRF) |
| Open Source Code | No | Justification: Our code release needs to go through a legal review process from our institute, and we will release the code after the legal review is complete. |
| Open Datasets | Yes | Datasets. For the classification task, we compare MITRA on three established 10-fold benchmarks: Tab Repo [30]; Tabzilla [25]; and Auto ML benchmarks [11]. We additionally evaluate on a concurrent benchmark Tab Arena [8] in Appendix C.4. For the regression task, we compare on the 10-fold Tab Repo [30] benchmark. ... The dataset task IDs are provided as follows: Tab Repo: 2, 11, 37, ... AMLB: 2073, 146818, ... Tab Zilla: 4, 9, 10, ... Tab Repo Reg: 167210, 359930, ... |
| Dataset Splits | Yes | Datasets. For the classification task, we compare MITRA on three established 10-fold benchmarks: Tab Repo [30]; Tabzilla [25]; and Auto ML benchmarks [11]. ... To compare with 1D models, e.g., Tab PFN that support features up to 100, and 2D models, e.g., Tab PFNv2 that support features up to 500, we evaluate on both small-feature and large-feature benchmarks. Algorithm 1 MITRA: Mixture of Synthetic Priors for pretraining TFMs ... 4: Randomly partition D(i) into support set D(i) sup and query set Dqry with |D(i) sup| = s and |D(i) qry| = q. |
| Hardware Specification | Yes | For pretraining MITRA, we use eight 40GB A100 GPUs. |
| Software Dependencies | No | Our implementation is based on PYTORCH. ... For these indirectly sampled TBPs, we use the classifiers and regressors from scikit-learn [27]. |
| Experiment Setup | Yes | MITRA is built on Transformer architecture [33] with 12 layers, 512 embedding size and 4 attention heads. Each Transformer layer includes both row-wise attention and column-wise attention implemented using Flash Attention [4]. The resulting model contains 72M parameters. MITRA 1D is built on Transformer architecture, and each layer contains row-wise attention. The resulting model contains 37M parameters. For pretraining MITRA, we use eight 40GB A100 GPUs. MITRA is trained on 45 million synthetically generated datasets. This training takes approximately 60 hours on 8 GPUs (Nvidia A100s). To normalize features, we apply uniform quantile transform based on the support set, followed by standard normalization based on the mean and standard deviation from the support set. For regression tasks, we additionally apply min-max normalization for the target column using the minimum and maximum values of the support set for each table. For models incorporating ensembling, we use the default number of estimators for each model, i.e., 4 for Tab PFNv2 on classification tasks, 8 for Tab PFNv2 on regression tasks, 32 for Tab ICL, 3 for Tab PFN. ... All models with +f are fine-tuned for 50 epochs, which is a setting that typically triggers early stopping on most datasets. |