Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Do-PFN: In-Context Learning for Causal Effect Estimation

Authors: Jake Robertson, Arik Reuter, Siyuan Guo, Noah Hollmann, Frank Hutter, Bernhard Schölkopf

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through extensive experiments in synthetic and semi-synthetic settings, we show that our approach allows for the accurate estimation of causal effects without knowledge of the underlying causal graph.
Researcher Affiliation	Collaboration	1Prior Labs, Freiburg, Germany 2ELLIS Institute Tübingen, Tübingen, Germany 3University of Freiburg, Freiburg, Germany 4Max Planck Institute for Intelligent Systems, Tübingen, Germany 5University of Cambridge, Cambridge, United Kingdom
Pseudocode	Yes	Algorithm 1: Prior-fitting with SGD. Do-PFN is pre-trained on pairs of synthetic observational and interventional datasets; the model is trained to predict interventional outcomes yin given a covariate-vector xin, the value of an intervention tin and an observational dataset Dob.
Open Source Code	Yes	We provide our pre-trained model, pre-training data generating code, and case study datasets at https://github.com/jr2021/Do-PFN.
Open Datasets	Yes	We evaluate the performance of Do-PFN on six case studies across more than 1,000 synthetic datasets, the popular Real Cause benchmark (Neal et al., 2020), as well as two observational datasets with widely agreed upon causal graphs. We provide our pre-trained model, pre-training data generating code, and case study datasets at https://github.com/jr2021/Do-PFN.
Dataset Splits	No	The paper does not explicitly provide training/test/validation dataset splits. It mentions generating synthetic datasets and using benchmarks, but no specific splits within those are described.
Hardware Specification	Yes	We primarily evalauate two versions of Do-PFN, v1 and v1.1, which are pretrained for 48 hours and 96 hours respectively on a single RTX 2080.
Software Dependencies	No	We use Pytorch (Paszke, 2019) to implement all our experiments. Our implementation of the causal prior is based on the Causal Playground library (Sauter et al., 2024) and the codebase used for Tab PFN (Hollmann et al., 2023, 2025). We use Matplotlib (Hunter, 2007), Autorank (Herbold, 2020) and Seaborn (Waskom, 2021) for our plots. The software components are mentioned with citations to their original papers, but specific version numbers of the software used in this implementation are not provided, which is required for reproducibility.
Experiment Setup	Yes	Do-PFN has 7.3 million parameters and is trained with Algorithm 1, with details in Appendix C. We primarily evalauate two versions of Do-PFN, v1 and v1.1, which are pretrained for 48 hours and 96 hours respectively on a single RTX 2080. In practice, we perform mini-batch stochastic gradient descent using the Adam optimizer (Kingma and Ba, 2014).