Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Multi-Accurate CATE is Robust to Unknown Covariate Shifts

Authors: Christoph Kern, Michael P. Kim, Angela Zhou

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct a thorough empirical study comparing finite- and large-sample performance of multi-accurate learning and other causal machine learning techniques more specifically tailored for causal structure. In Section 5, we provide extensive empirical comparisons in simulated data and a case study of the Women s Health Initiative parallel clinical trial and observational study.
Researcher Affiliation	Academia	Christoph Kern EMAIL Department of Statistics Ludwig-Maximilians-University of Munich Munich Center for Machine Learning (MCML); Michael Kim EMAIL Department of Computer Science Cornell University; Angela Zhou EMAIL Department of Data Sciences and Operations University of Southern California
Pseudocode	Yes	Algorithm 1 Multi-accuracy for CATE estimation for Setting 1, unknown covariate shifts; Algorithm 2 Multi-accuracy for CATE estimation for calibrating CATE on small Randomized Controlled Trial data; Algorithm 3 Multi-accurate DR-learner (Equation (8)) for unknown covariate shift; Algorithm 4 MCBoost
Open Source Code	Yes	We provide code of the simulation studies and the real data application for replication purposes in the following public OSF repository: https://osf.io/zxjvw/?view_only=a622c123414e4be6a218f121ded191d3
Open Datasets	Yes	We next present a case study that draws on data from the Women s Health Initiative (WHI) studies (Machens and Schmidt-Gollwitzer, 2003).
Dataset Splits	Yes	The size of the (audit/RCT) data used for multi-calibration boosting (500 observations) and the (test) data used for model evaluation (5000 observations) is fixed. We vary the shift intensity s {0, 0.25, . . . , 2} and training set size {500, 2000, 3500, 5000}, and run experiments for each combination 25 times. We start with the observational study (OS) (52,335 observations) and draw a random 50% sample that serves as observational training data for (naive) CATE estimation. We split the clinical trial data (14,531 observations) into an initial 50% training set and a 50% test set.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments. It mentions R packages for implementation but no hardware.
Software Dependencies	Yes	Data preparations, model training and evaluation are conducted in R (3.6.3) (R Core Team, 2020) using the packages ranger (0.13.1) (Wright and Ziegler, 2017), grf (2.0.2) (Tibshirani et al., 2021) and rlearner (1.1.0) (Nie and Wager, 2020). The simulation studies heavily draw on the causal experiment simulator of the causal Toolbox (0.0.2.000) (Künzel et al., 2019) package. In all experiments, (initial) T-learner and DR-learner are post-processed using the MCBoost algorithm as implemented in the mcboost (0.4.2) (Pfisterer et al., 2021) package.
Experiment Setup	Yes	Table 2: Hyperparameter settings for post-processing using MCBoost.; Table 3: Hyperparameter settings of (baseline) CATE learners.; Table 13: Hyperparameter settings for post-processing using MCBoost.; Table 14: Hyperparameter settings of (baseline) CATE learners. These tables specify various hyperparameters such as max_iter, alpha, eta, num.trees, mtry, sample.fraction, honesty.fraction, min.node.size, and maxdepth.