Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

An Investigation of Memorization Risk in Healthcare Foundation Models

Authors: Sana Tonekaboni, Lena Stempfle, Adibvafa Fallahpour, Walter Gerych, Marzyeh Ghassemi

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we introduce a suite of black-box evaluation tests to assess privacy-related memorization risks in foundation models trained on structured EHR data. Our framework includes methods for probing memorization at both the embedding and generative levels, and aims to distinguish between model generalization and harmful memorization in clinically relevant settings. We contextualize memorization in terms of its potential to compromise patient privacy, particularly for vulnerable subgroups. We validate our approach on a publicly available EHR foundation model and release an open-source toolkit to facilitate reproducible and collaborative privacy assessments in healthcare AI.
Researcher Affiliation	Academia	Sana Tonekaboni MIT Broad Institute of MIT and Harvard Vector Institute EMAIL Lena Stempfle MIT Chalmers University of Technology University of Gothenburg EMAIL Adibvafa Fallahpour University of Toronto Vector Institute University Health Network (UHN) EMAIL Walter Gerych Worcester Polytechnic Institute Computer Science Department EMAIL Marzyeh Ghassemi MIT EMAIL
Pseudocode	No	The paper describes methods and tests but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	and release an open-source toolkit to facilitate reproducible and collaborative privacy assessments in healthcare AI. 2 Code available at https://github.com/sanatonek/EHR-FM_memorization
Open Datasets	Yes	Also, EHRMamba2 is trained on the public MIMIC-IV dataset [23], which enables direct testing of memorization on known training samples, unlike other released models that are trained on private and inaccessible datasets.
Dataset Splits	Yes	We train the probing model with embeddings of a separate test cohort as well as varying fractions of training data. High accuracy with minimal data suggests strong memorization rather than generalization. Table 6 in the appendix reports probing attack performance on sensitive diagnoses, measured across different prompt lengths (10, 20, 50 tokens) and training. Extracting sensitive information only from the embeddings is difficult, even if an adversary can access a portion of the training data. AUROC values across all sensitive attributes remain around 0.5, indicating no clear memorization signal and suggesting random performance. Notably, AUPRC and F1 generally decline with increased training data, hinting at inconsistent memorization patterns. These trends may be partially influenced by dataset size variations: 113,579 for the test set, 102,222 for 0.1%, and 90,864 for 20%.
Hardware Specification	No	The compute resources required to perform our tests depend on the inference cost of the EHR-FM, as the method relies on these models to generate sequences for evaluation.
Software Dependencies	No	The paper mentions specific models like Med BERT and EHRMamba2 but does not provide details on software dependencies such as libraries or frameworks with their version numbers.
Experiment Setup	Yes	Analysis: We compare the average distance (measured by d EMD) of generated codes to the true trajectories for different prompt setups (Random, Static, 10, 20, and 50 codes), on our benchmark model. Figure 2a shows the distance over 100 prediction codes (\|s\| = 100) for 3K individuals in the pretraining cohort. Following the strategy of language models [9], for every prompt, hundreds of trajectories are sampled, and the distribution is used to quantify memorization.