Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Conformal Prediction for Causal Effects of Continuous Treatments

Authors: Maresa Schröder, Dennis Frauen, Jonas Schweisthal, Konstantin Hess, Valentyn Melnychuk, Stefan Feuerriegel

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the effectiveness of the conformal prediction intervals in experiments on synthetic and real-world datasets. To the best of our knowledge, we are the first to propose conformal prediction for continuous treatments when the propensity score is unknown and must be estimated from data. 5 Experiments Baselines: As we have discussed above, there are no baselines that directly compute prediction intervals with finite-sample guarantees for potential outcomes of continuous treatments. Therefore, we compare our method against MC dropout [14] and deep ensemble methods [33]. Performance metrics: We evaluate the methods in terms of whether the prediction intervals are faithful [e.g., 21]. That is, we compute whether the empirical coverage of the prediction intervals surpasses the threshold of 1 α for different significance levels α {0.05, 0.1, 0.2}. Additionally, we report the width of the resulting intervals in Supplement H. 5.1 Datasets Synthetic datasets: We follow common practice and evaluate our methods using synthetic datasets [e.g., 2, 25]. Medical dataset: We demonstrate the applicability of our CP method to medical datasets on the MIMIC-III dataset [26].
Researcher Affiliation	Academia	Maresa Schr oder LMU Munich Munich Center for Machine Learning EMAIL Dennis Frauen LMU Munich Munich Center for Machine Learning EMAIL Jonas Schweisthal LMU Munich Munich Center for Machine Learning EMAIL Konstantin Hess LMU Munich Munich Center for Machine Learning EMAIL Valentyn Melnychuk LMU Munich Munich Center for Machine Learning EMAIL Stefan Feuerriegel LMU Munich Munich Center for Machine Learning EMAIL
Pseudocode	Yes	We now use Thm. 4.5 to present an algorithm for computing CP intervals of potential outcomes from continuous treatment variables under unknown propensities in Alg. 1. We present a similar algorithm for scenario 1 with known propensities and discuss the computational complexity in below. Algorithm 1: Algorithm for computing CP intervals of potential outcomes of continuous interventions under unknown propensities. Algorithm 2: Algorithm for computing CP intervals of potential outcomes of continuous interventions under known propensities.
Open Source Code	Yes	Code and data are available at our public Git Hub repository: https://github.com/m-schroder/ Continuous Causal CP
Open Datasets	Yes	Synthetic datasets: We follow common practice and evaluate our methods using synthetic datasets [e.g., 2, 25]. Due to the fundamental problem of causal inference, counterfactual outcomes are never observable in real-world datasets. Synthetic datasets enable us to access counterfactual outcomes and thus to benchmark methods in terms of whether the computed intervals are faithful. Additionally, we perform experiments on the semi-synthic TCGA dataset in Supplement C. We hereby show the applicability of our method to high-dimensional real-world data in a controlled environment. Medical dataset: We demonstrate the applicability of our CP method to medical datasets on the MIMIC-III dataset [26]. MIMIC-III contains de-identified health records from patients admitted to critical care units at large hospitals. Our goal is to predict patient outcomes in terms of blood pressure when treated with a different duration of mechanical ventilation. We use 8 confounders from medical practice (e.g., respiratory rate, hematocrit). Overall, we consider 14,719 patients, split into train (60%), validation (10%), calibration (20%), and test (10%) sets. Details are in Supplement G. [26] A. E. W. Johnson, T. J. Pollard, L. Shen, L.-W. H. Lehman, M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. A. Celi, and R. G. Mark. MIMIC-III, a freely accessible critical care database. Scientific Data, 3(1):1 9, 2016.
Dataset Splits	Yes	Throughout our work, we split the dataset into a proper training dataset DT = (Xi, Ai, Yi)i=1,...,m, and a calibration dataset DC = (Xi, Ai, Yi)i=m+1,...,n. Overall, we consider 14,719 patients, split into train (60%), validation (10%), calibration (20%), and test (10%) sets. Details are in Supplement G.
Hardware Specification	Yes	All experiments were run on an AMD Ryzen 7 PRO 6850U 2.70 GHz CPU with eight cores and 32GB RAM.
Software Dependencies	No	Our experiments are implemented in Py Torch Lightning. We provide our code in our Git Hub repository. All experiments were run on an AMD Ryzen 7 PRO 6850U 2.70 GHz CPU with eight cores and 32GB RAM. We limited the experiments to standard multi-layer perception (MLP) regression models, consisting of three layers of width 16 with Re Lu activation function and MC dropout at a rate of 0.1, optimized via Adam. We did not perform hyperparameter optimization, as our method aimed to provide an agnostic prediction interval applicable to any prediction model. All models were trained for 300 epochs with batch size 32. Our algorithm requires solving (non-convex) optimization problems through mathematical optimization. We chose to employ two interior-point solvers in our experiments: For the experiments with soft interventions that pose convex optimization problems, we use the solver MOSEK. For the hard interventions, which included non-convex problems, we used the solver IPOPT. Both solvers were run with default parameters.
Experiment Setup	Yes	Implementation: All methods are implemented with ϕ as a multi-layer perceptron (MLP) and an MC dropout regularization of rate 0.1. Crucially, we use the identical MLP for both our CP method and MC dropout. Hence, all performance gains must be attributed to the coverage guarantees of our conformal method. In the MC dropout baseline, the uncertainty intervals are computed via Monte Carlo sampling. In scenario 2 , we perform conditional density estimation by conditional normalizing flows [49]. Implementation and training details are in Supplement G. We limited the experiments to standard multi-layer perception (MLP) regression models, consisting of three layers of width 16 with Re Lu activation function and MC dropout at a rate of 0.1, optimized via Adam. We did not perform hyperparameter optimization, as our method aimed to provide an agnostic prediction interval applicable to any prediction model. All models were trained for 300 epochs with batch size 32.