Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
An Investigation of Memorization Risk in Healthcare Foundation Models
Authors: Sana Tonekaboni, Lena Stempfle, Adibvafa Fallahpour, Walter Gerych, Marzyeh Ghassemi
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we introduce a suite of black-box evaluation tests to assess privacy-related memorization risks in foundation models trained on structured EHR data. Our framework includes methods for probing memorization at both the embedding and generative levels, and aims to distinguish between model generalization and harmful memorization in clinically relevant settings. We contextualize memorization in terms of its potential to compromise patient privacy, particularly for vulnerable subgroups. We validate our approach on a publicly available EHR foundation model and release an open-source toolkit to facilitate reproducible and collaborative privacy assessments in healthcare AI. |
| Researcher Affiliation | Academia | Sana Tonekaboni MIT Broad Institute of MIT and Harvard Vector Institute EMAIL Lena Stempfle MIT Chalmers University of Technology University of Gothenburg EMAIL Adibvafa Fallahpour University of Toronto Vector Institute University Health Network (UHN) EMAIL Walter Gerych Worcester Polytechnic Institute Computer Science Department EMAIL Marzyeh Ghassemi MIT EMAIL |
| Pseudocode | No | The paper describes methods and tests but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | and release an open-source toolkit to facilitate reproducible and collaborative privacy assessments in healthcare AI. 2 Code available at https://github.com/sanatonek/EHR-FM_memorization |
| Open Datasets | Yes | Also, EHRMamba2 is trained on the public MIMIC-IV dataset [23], which enables direct testing of memorization on known training samples, unlike other released models that are trained on private and inaccessible datasets. |
| Dataset Splits | Yes | We train the probing model with embeddings of a separate test cohort as well as varying fractions of training data. High accuracy with minimal data suggests strong memorization rather than generalization. Table 6 in the appendix reports probing attack performance on sensitive diagnoses, measured across different prompt lengths (10, 20, 50 tokens) and training. Extracting sensitive information only from the embeddings is difficult, even if an adversary can access a portion of the training data. AUROC values across all sensitive attributes remain around 0.5, indicating no clear memorization signal and suggesting random performance. Notably, AUPRC and F1 generally decline with increased training data, hinting at inconsistent memorization patterns. These trends may be partially influenced by dataset size variations: 113,579 for the test set, 102,222 for 0.1%, and 90,864 for 20%. |
| Hardware Specification | No | The compute resources required to perform our tests depend on the inference cost of the EHR-FM, as the method relies on these models to generate sequences for evaluation. |
| Software Dependencies | No | The paper mentions specific models like Med BERT and EHRMamba2 but does not provide details on software dependencies such as libraries or frameworks with their version numbers. |
| Experiment Setup | Yes | Analysis: We compare the average distance (measured by d EMD) of generated codes to the true trajectories for different prompt setups (Random, Static, 10, 20, and 50 codes), on our benchmark model. Figure 2a shows the distance over 100 prediction codes (|s| = 100) for 3K individuals in the pretraining cohort. Following the strategy of language models [9], for every prompt, hundreds of trajectories are sampled, and the distribution is used to quantify memorization. |