Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Effortless, Simulation-Efficient Bayesian Inference using Tabular Foundation Models

Authors: Julius Vetter, Manuel Gloeckler, Daniel Gedon, Jakob H Macke

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To assess NPE-PFN, we conduct experiments on synthetic SBI benchmark tasks and real data, covering scenarios from low to high-dimensional data and including cases with model misspecification. We evaluate NPE-PFN on various tasks from the SBI benchmark [27], which provides ground truth posterior samples for 10 observations for each task. We measure posterior sample quality using the classifier two-sample test (C2ST, 53).
Researcher Affiliation	Academia	1Machine Learning in Science, University of Tübingen, Tübingen, Germany 2Tübingen AI Center, Tübingen, Germany 3Department Empirical Inference, Max Planck Institute for Intelligent Systems, Tübingen, Germany {firstname.lastname}@uni-tuebingen.de
Pseudocode	Yes	pseudocode for NPE-PFN in Appendix Alg. 1. ... pseudocode for TSNPE-PFN in Appendix Alg. 2.
Open Source Code	Yes	Code available at https://github.com/mackelab/npe-pfn. ... Code to use NPE-PFN and reproduce the results is available at https://github. com/mackelab/npe-pfn.
Open Datasets	Yes	We evaluate NPE-PFN on various tasks from the SBI benchmark [27] ... We infer posteriors for 10 real observations from the Allen cell type database [61] ... we apply Tab PFN on some classical unconditional density estimation benchmark tasks from the UCI repository [56].
Dataset Splits	Yes	Training was stopped early based on the validation loss, as evaluated on a held-out set containing 10% of the available simulations. ... We equally divide the simulation budget into 10 rounds ... we use 103, 104, or (if applicable) 105 samples for training.
Hardware Specification	Yes	We use a mix of Nvidia 2080TI, A100, and H100 GPUs to obtain the results related to NPE-PFN. ... SBI baselines were run on 8 CPU cores ... All runtimes for NPE-PFN (Fig. 2b) were obtained using an Nvidia A100 GPU, where possible. For the unfiltered variant of NPE-PFN, an H100 GPU was used for the large context containing 105 simulations.
Software Dependencies	No	The paper mentions PyTorch [86], Tab PFN [38] library, SBI library [54], Hydra [87] and Adam optimizer [90], and refers to specific flow types (neural spline flow [42], masked autoregressive flow [41]). However, it does not provide specific version numbers for these software components, only citations to their respective papers.
Experiment Setup	Yes	Training was performed using the Adam optimizer [90] with a batch size of 200 and a learning rate of 5 x 10^-4. Training was stopped early based on the validation loss... In all experiments, we use the default version of the Tab PFN classifier or regressor for (TS)NPE-PFN, with no changes to hyperparameters such as the softmax temperature.