Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Evaluating and Learning Optimal Dynamic Treatment Regimes under Truncation by Death

Authors: Sihyung Park, Wenbin Lu, Shu Yang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical validation and an application to electronic health records showcase its utility for personalized treatment optimization. [...] Section 5 demonstrates multiply robust off-policy learning using the proposed estimator. Section 6 demonstrates multiple robustness to nuisance model misspeciﬁcation. We show MR estimator can facilitate decision-making for high-risk patients group by applying it to MIMIC-III database (Section 7).
Researcher Affiliation	Academia	Sihyung Park Department of Statistics North Carolina State University Raleigh, NC 27695 EMAIL Wenbin Lu Department of Statistics North Carolina State University Raleigh, NC 27695 EMAIL Shu Yang Department of Statistics North Carolina State University Raleigh, NC 27695 EMAIL
Pseudocode	Yes	Algorithm 1 Compute VMR(π) via cross-ﬁtting.
Open Source Code	Yes	Justiﬁcation: We have made the code used to generate our results available. The zip ﬁle contains a YAML ﬁle that can reproduce the same conda environment we used.
Open Datasets	Yes	To illustrate the utility of our proposed methodology, we applied it to the Medical Information Mart for Intensive Care III (MIMIC-III) v1.4 database. MIMIC-III is a publicly accessible, MIT-licensed database containing de-identiﬁed health records from over 40,000 patients admitted to critical care units at Beth Israel Deaconess Medical Center between 2001 and 2012. [...] Johnson et al. (2016) provides a detailed description.
Dataset Splits	Yes	In each iteration, stratiﬁed sampling on censoring (C1, C2) and survival (S1, S2) indicators was used to create balanced training and test sets. Policies were learned on training data and their value estimated on test data. [...] We used empirical version of this formula with true nuisance models and an independently generated large sample of size 100,000 to compute PCD-AS.
Hardware Specification	Yes	The off-policy learning simulation ran on an internal cluster, with each iteration on a single core, 8 GB RAM instance. Other experiments and the MIMIC-III application used a CPU machine with 16 GB RAM.
Software Dependencies	Yes	Justiﬁcation: We have made the code used to generate our results available. The zip ﬁle contains a YAML ﬁle that can reproduce the same conda environment we used.
Experiment Setup	Yes	We employed logistic regression models for estimating the propensity score, censoring and survival probability. For continuous outcome models, we ﬁtted random forest regressors. Lastly, generalized additive models were ﬁtted to estimate the conditional mean functions, mp2 and mµ2. We used a differential evolution algorithm to optimize within the class of linear policies.