Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Finding Regions of Heterogeneity in Decision-Making via Expected Conditional Covariance

Authors: Justin Lim, Christina X Ji, Michael Oberst, Saul Blecker, Leora Horwitz, David Sontag

NeurIPS 2021 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In a semi-synthetic experiment, we show that our algorithm recovers the correct region of heterogeneity accurately compared to baselines. Finally, we apply our algorithm to real-world healthcare datasets, recovering variation that aligns with existing clinical knowledge.
Researcher Affiliation	Collaboration	Justin Lim MIT CSAIL and IMES Cambridge, MA EMAIL Christina X Ji MIT CSAIL and IMES Cambridge, MA EMAIL Michael Oberst MIT CSAIL and IMES Cambridge, MA EMAIL Saul Blecker NYU Langone New York, NY EMAIL Leora Horwitz NYU Langone New York, NY EMAIL David Sontag MIT CSAIL and IMES Cambridge, MA EMAIL. This work was supported in part by Independence Blue Cross
Pseudocode	Yes	Algorithm 1 Identifying regions with variation
Open Source Code	Yes	Our code is available at https://github.com/clinicalml/finding-decision-heterogeneity-regions.
Open Datasets	Yes	Dataset: We use publicly available data from Lin et al. (2020), who ask participants on Amazon s Mechanical Turk platform to make recidivism predictions based on information present in the Correctional Oﬀender Management Proﬁling for Alternative Sanctions (COMPAS) dataset for Broward County, FL (Dressel and Farid, 2018).
Dataset Splits	Yes	After requiring at least 4 patients per agent, 3,576 patients and 176 group practices are included. This ﬁlter ensures each group practice has at least 1 sample in the training and validation sets and at least 2 samples in the test set.
Hardware Specification	No	The paper mentions 'CPU with 32 cores, 256GB of RAM' in Appendix B.1 and 'All experiments ran on CPUs with 32 cores and 256GB of RAM' in the main text of the appendix, but it does not specify exact CPU models (e.g., Intel Xeon E5-2630, AMD Ryzen) or GPU models, if any were used.
Software Dependencies	No	The paper mentions using PyTorch and Scikit-Learn but does not provide specific version numbers for these or any other software libraries or dependencies used in the experiments.
Experiment Setup	Yes	In this experiment, we choose β = 0.25 as input to our algorithm. Adam optimizer with learning rate 0.001.