Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Do-PFN: In-Context Learning for Causal Effect Estimation
Authors: Jake Robertson, Arik Reuter, Siyuan Guo, Noah Hollmann, Frank Hutter, Bernhard Schölkopf
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive experiments in synthetic and semi-synthetic settings, we show that our approach allows for the accurate estimation of causal effects without knowledge of the underlying causal graph. |
| Researcher Affiliation | Collaboration | 1Prior Labs, Freiburg, Germany 2ELLIS Institute Tübingen, Tübingen, Germany 3University of Freiburg, Freiburg, Germany 4Max Planck Institute for Intelligent Systems, Tübingen, Germany 5University of Cambridge, Cambridge, United Kingdom |
| Pseudocode | Yes | Algorithm 1: Prior-fitting with SGD. Do-PFN is pre-trained on pairs of synthetic observational and interventional datasets; the model is trained to predict interventional outcomes yin given a covariate-vector xin, the value of an intervention tin and an observational dataset Dob. |
| Open Source Code | Yes | We provide our pre-trained model, pre-training data generating code, and case study datasets at https://github.com/jr2021/Do-PFN. |
| Open Datasets | Yes | We evaluate the performance of Do-PFN on six case studies across more than 1,000 synthetic datasets, the popular Real Cause benchmark (Neal et al., 2020), as well as two observational datasets with widely agreed upon causal graphs. We provide our pre-trained model, pre-training data generating code, and case study datasets at https://github.com/jr2021/Do-PFN. |
| Dataset Splits | No | The paper does not explicitly provide training/test/validation dataset splits. It mentions generating synthetic datasets and using benchmarks, but no specific splits within those are described. |
| Hardware Specification | Yes | We primarily evalauate two versions of Do-PFN, v1 and v1.1, which are pretrained for 48 hours and 96 hours respectively on a single RTX 2080. |
| Software Dependencies | No | We use Pytorch (Paszke, 2019) to implement all our experiments. Our implementation of the causal prior is based on the Causal Playground library (Sauter et al., 2024) and the codebase used for Tab PFN (Hollmann et al., 2023, 2025). We use Matplotlib (Hunter, 2007), Autorank (Herbold, 2020) and Seaborn (Waskom, 2021) for our plots. The software components are mentioned with citations to their original papers, but specific version numbers of the software *used in this implementation* are not provided, which is required for reproducibility. |
| Experiment Setup | Yes | Do-PFN has 7.3 million parameters and is trained with Algorithm 1, with details in Appendix C. We primarily evalauate two versions of Do-PFN, v1 and v1.1, which are pretrained for 48 hours and 96 hours respectively on a single RTX 2080. In practice, we perform mini-batch stochastic gradient descent using the Adam optimizer (Kingma and Ba, 2014). |