Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

RCT Rejection Sampling for Causal Estimation Evaluation

Authors: Katherine A. Keith, Sergey Feldman, David Jurgens, Jonathan Bragg, Rohit Bhattacharya

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Using synthetic data, we show our algorithm indeed results in low bias when oracle estimators are evaluated on the confounded samples, which is not always the case for a previously proposed algorithm. In addition to this identification result, we highlight several finite data considerations for evaluation designers who plan to use RCT rejection sampling on their own datasets. As a proof of concept, we implement an example evaluation pipeline and walk through these finite data considerations with a novel, real-world RCT which we release publicly consisting of approximately 70k observations and text data as high-dimensional covariates.
Researcher Affiliation	Collaboration	Katherine A. Keith EMAIL Williams College Sergey Feldman EMAIL Allen Institute for Artificial Intelligence David Jurgens EMAIL University of Michigan Jonathan Bragg EMAIL Allen Institute for Artificial Intelligence Rohit Bhattacharya EMAIL Williams College
Pseudocode	Yes	Algorithm 1 RCT rejection sampling
Open Source Code	Yes	We also release our code.1 1Code and data at https://github.com/kakeith/rct_rejection_sampling.
Open Datasets	Yes	As a proof of concept, we implement an example evaluation pipeline and walk through these finite data considerations with a novel, real-world RCT which we release publicly consisting of approximately 70k observations and text data as high-dimensional covariates. We release this novel, real-world RCT dataset of approximately 70k observations that has text as covariates ( 4.1.1). We also release our code.1 1Code and data at https://github.com/kakeith/rct_rejection_sampling.
Dataset Splits	Yes	We fit our models using cross-fitting (Hansen, 2000; Newey & Robins, 2018) and cross-validation; see Appendix F for more details. Cross-fitting with cross-validation. We fit our models using cross-fitting (Newey & Robins, 2018) which is also called sample-splitting (Hansen, 2000). Here, we divide the data into K folds. For each inference fold j, the other K 1 folds (shorthand j) are used as the training set to fit the base learners e.g., ˆQ j T0 or ˆg j where the superscript here indicates the data the model is fit on. The single hyperparameter for logistic regression is selected via cross-validation, where the training set is again split into folds.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU/CPU models or memory amounts used for running experiments.
Software Dependencies	No	The paper mentions several software packages like scikit-learn, CatBoost, and Econ ML but does not provide specific version numbers for any of them. For example, it states: "Using scikit-learn Pedregosa et al. (2011)" and "Using Cat Boost (Dorogush et al., 2018) with default parameters".
Experiment Setup	Yes	As a proof of concept, we apply baseline causal estimation models to the resulting DOBS datasets after RCT rejection sampling (with many random seeds); as we mention above. We implement13 commonly-used causal estimation methods via two steps: (1) fitting base learners and (2) using causal estimators that combine the base learners via plug-in principles or second-stage regression. Using Cat Boost (Dorogush et al., 2018) with default parameters and without cross-validation. Using scikit-learn (Pedregosa et al., 2011) and an elasticnet penalty, L1 ratio 0.1, balanced class weights, and SAGA solver. We tune the regularization parameter C via cross-validation over the set C 1e 4, 1e 3, 1e 2, 1e 1, 1e0, 1e1.