Understanding Learned Models by Identifying Important Features at the Right Resolution

Authors: Kyubin Lee, Akshay Sood, Mark Craven4155-4163

AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our approach by analyzing random forest and LSTM neural network models learned in two challenging biomedical applications.
Researcher Affiliation Academia Kyubin Lee Clinical Genomics Analysis Branch National Cancer Center Republic of Korea Akshay Sood Dept. of Computer Sciences Dept. of Biostatistics & Medical Informatics University of Wisconsin-Madison Mark Craven Dept. of Biostatistics & Medical Informatics Dept. of Computer Sciences University of Wisconsin-Madison
Pseudocode Yes Algorithm 1: General approach to identifying important features via perturbation
Open Source Code Yes The source code for our methods is available at https://github.com/Craven-Biostat-Lab/mihifepe.
Open Datasets No The paper describes datasets used (HSV-1 and EHR data from University of Wisconsin Health System) but does not provide concrete access information (link, DOI, or specific citation for public access) for them.
Dataset Splits Yes Using 10-fold cross-validation to assess the predictive accuracy of the networks results in an area under the ROC curve (AUROC) of 0.757.
Hardware Specification No No specific hardware details (e.g., exact GPU/CPU models, memory amounts, or detailed computer specifications) used for running experiments were provided.
Software Dependencies No The paper mentions software like Med2Vec and various models (Random Forest, LSTM), but does not provide specific version numbers for any of the software dependencies used.
Experiment Setup Yes Our LSTM networks have a cell state of size 100 and a sigmoid output layer. The coded diagnoses, problem diagnoses, and interventions (procedures and medications) all comprise large vocabularies (6,533 for coded diagnoses, 4,398 for problem diagnoses, and 8,745 for interventions) of which only a small subset is recorded at each encounter. Therefore, we first map event vectors for each of these sets to an embedded space using Med2Vec (Choi et al. 2016), resulting in shorter, dense fixed-length vectors. Separate embeddings of size 200 were generated for each of these sets, which were then concatenated, along with the other temporal features, to produce the event representation at each timestamp in the record.