Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

CausalPFN: Amortized Causal Effect Estimation via In-Context Learning

Authors: Vahid Balazadeh Meresht, Hamidreza Kamkari, Valentin Thomas, Junwei Ma, Bingru Li, Jesse C. Cresswell, Rahul G Krishnan

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our approach achieves superior average performance on heterogeneous and average treatment effect estimation benchmarks (IHDP, Lalonde, ACIC). Moreover, it shows competitive performance for real-world policy making on uplift modeling tasks. ... Table 1: CATE & ATE results. Columns correspond to benchmark suites: IHDP, ACIC 2016, Lalonde CPS/PSID.
Researcher Affiliation	Collaboration	1 University of Toronto 2 Vector Institute 3 Layer 6 AI
Pseudocode	Yes	Algorithm 1 Parallel training of Causal PFN. Require: Prior π, DGPs and CEPO values P ψ obs, µt( ; ψ), model qθ, DGP batch size Bt, query batch size Bq, fixed feature length F, and histogram loss HL (10) [44]. 1: while not converged do 2: Sample ψ[1], . . . , ψ[Bt] π 3: Sample Dobs[i] P ψ[i] obs , 1 i Bt 4: Randomly sample query treatments t(i,j) for 1 i Bt, 1 j Bq 5: Sample query covariates x(i,j) P ψ obs[i] for 1 i Bt, 1 j Bq 6: Set µ(i,j) µt(i,j) x(i,j) ; ψ[i] 7: Pad x(i,j) with zeros such that x(i,j) RF 8: b L 1 Bt Bq P i,j HL h µ(i,j) qθ( \| x(i,j), t(i,j), Dobs[i]) i 9: Update θ using the gradients θ b L 10: end while
Open Source Code	Yes	Causal PFN is fast, ready-to-use, and does not require any further training or hyperparameter tuning. (https://github.com/vdblm/Causal PFN). ... Finally, we release our model s weights with a user-friendly API, streamlining the adoption of Causal PFN as a capable estimator. ... We publish Causal PFN as a standalone Py PI package (https://pypi.org/ project/causalpfn/), along with the instructions to reproduce all the results in the paper.
Open Datasets	Yes	Table 1 compares Causal PFN to all baselines on four standard set of datasets: 100 realizations of IHDP [94, 40], 10 realizations of ACIC 2016 [27], and the Lalonde CPS and Lalonde PSID cohorts [58] with their causal effects provided by Real Cause (each with 100 realizations) [81]. ... We benchmark Causal PFN on five large marketing RCTs from the scikit-uplift library [74]. The first dataset, Hillstrom [41]... We also evaluate Causal PFN on four larger campaigns Lenta, Retail Hero (X5), Megafon (Mega), and Criteo [61, 95, 78, 122].
Dataset Splits	Yes	To build Qini curves we follow scikit-uplift s recommended five-fold stratified split based on the outcome and the treatment [74]. In each fold, we hold out 20% of the data as test rows and train the baseline models on the remaining 80%.
Hardware Specification	Yes	We use a single A100 GPU for 7 days to train the base Causal PFN. We also use an H100 GPU for 2 days to produce the Tab PFN fine-tuned results in Table 3. Apart from that, all of the other experiments are run on either a desktop RTX6000, A100, or an H100. ... All inference, including baselines, performed on an 80 GB H100 GPU, 32 CPUs, and 256 GB RAM.
Software Dependencies	No	The paper mentions several software components and libraries like Adam W [55], FLAML (Auto ML) library [113], Econ ML [11], CATENets [19], and scikit-uplift [74]. However, specific version numbers for these software components are not provided within the paper.
Experiment Setup	Yes	The full model has approximately 20M parameters and is trained in two stages: (i) a predictive phase that mimics standard predictive PFN training from Ma et al. [69], and (ii) a causal phase that optimizes the CEPO loss. We use Adam W [55] with warmup and cosine annealing for the predictive phase, and switch to the schedule-free optimizer [24] in the causal phase. The model is trained with a maximum context length of 16K in the first phase and 2,048 in the second. We use four A100 GPUs trained for at most one week for the initial phase, and two days on an H100 for the second phase. ... We discretize the outcome axis into L = 1024 bins and let the network project the query tokens into L logits. We then apply Soft Max to turn the logits into a quantized distribution qθ( \| x, t, Dobs)[ℓ], ℓ [L]. ... For the tuned results in Table 1, we performed hyperparameter tuning using the FLAML (Auto ML) library [113] on both the propensity and outcome models with (i) Time budget of 900 seconds, (ii) K-fold cross-validation with K = 3, (iii) Early stopping, and (iv) base estimators ["lgbm", "xgboost", "xgb_limitdepth", "rf", "kneighbor", "extra_tree", "lrl1", "lrl2"].