Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

ConTextTab: A Semantics-Aware Tabular In-Context Learner

Authors: Marco Spinaci, Marek Polewczyk, Maximilian Schambach, Sam Thelin

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We use a variety of tabular prediction datasets to evaluate and compare our approach to established baselines and other SOTA methods. Namely, we evaluate all models on the following benchmarks: Open ML-CC18 [1], a pure classification benchmark; Open ML-CTR23 [11], a pure regression benchmark; TALENT [40], a recently introduced diverse benchmark containing over 300 classification and regression benchmarks. Here, we focus on a subset containing 45 datasets that are representative of the overall performance of the baselines investigated in the original works, which we refer to as the TALENT-Tiny benchmark; Tab Re D [32], a small but challenging benchmark of large datasets representative of practical prediction tasks; and finally CARTE [23], a mixed classification and regression benchmark containing highly semantic features and few numerical ones.
Researcher Affiliation	Industry	Marco Spinaci 1 Marek Polewczyk 2 Maximilian Schambach 2 Sam Thelin2 1SAP France 2SAP SE {firstname.lastname}@sap.com
Pseudocode	No	The paper describes the model architecture and methods in prose and uses mathematical equations, but it does not contain a clearly labeled pseudocode or algorithm block.
Open Source Code	Yes	Code and model checkpoints are available at: https://github.com/SAP-samples/contexttab. The inference code and the trained weights will be released after acceptance to ensure double blind review. Pretraining code might be released, but it might be too tightly linked with the data platform to be usable.
Open Datasets	Yes	For pretraining, we use the T4 dataset [12]. We use a variety of tabular prediction datasets to evaluate and compare our approach to established baselines and other SOTA methods. Namely, we evaluate all models on the following benchmarks: Open ML-CC18 [1], a pure classification benchmark; Open ML-CTR23 [11], a pure regression benchmark; TALENT [40], a recently introduced diverse benchmark containing over 300 classification and regression benchmarks...; Tab Re D [32]...; and finally CARTE [23]...
Dataset Splits	Yes	We extracted all datasets from their original source and performed a custom stratified train-validation-test split with a 70-10-20 ratio. For classification tasks, the target column is used for stratification. For regression tasks, we perform stratification on the binned target column, binning it into 5 quantiles using the qcut method from the pandas library.
Hardware Specification	Yes	Under this setup, we train a base sized model on a single H100 GPU, reaching a throughput of roughly 10 tables/s. The experiments were conducted on a compute node with 40 CPU cores, 320 GB of RAM and an H100 GPU with 96 GB of VRAM. Cat Boost, Light GBM and XGBoost are evaluated on CPU machines with up to 256 GB of RAM, whereas Real MLP and Tab M are evaluated on H100 GPUs with 96 GB of VRAM.
Software Dependencies	Yes	We use the model from the official Python tabpfn package with version 2.1.0 together with the tabpfn-extensions package version 0.1.0... We use the latest model weights tabicl-classifier-v1.1-0506.ckpt from the recent 0.1.2 version of the official tabicl package. We use the model provided in the official Python carte-ai package with version 0.0.26. Throughout, evaluation is performed using scikit-learn v1.5.2. Throughout, we use Auto Gluon v1.2 and its Tabular Predictor without custom preprocessing. We use the official implementation from the mambular Py Pi package, version 1.5.1.
Experiment Setup	Yes	We train each model for between 4 and 10 million steps (i.e., 2 to 5 epochs) until convergence. We use a micro batch size of 1 and accumulate gradients to simulate a batch size of 256 (or 128 for smaller models of mini size). To improve stability, we employ gradient clipping and the Adam W optimizer with a maximum learning rate of 10 4, reached after a linear warm-up phase of 1000 gradient updates.