Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Surrogate Assisted Semi-supervised Inference for High Dimensional Risk Prediction
Authors: Jue Hou, Zijian Guo, Tianxi Cai
JMLR 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present an extensive simulation study to demonstrate the superiority of our approach compared to existing supervised methods. We apply the method to genetic risk prediction of type-2 diabetes mellitus using an EHR biobank cohort. |
| Researcher Affiliation | Academia | Jue Hou EMAIL Division of Biostatistics University of Minnesota School of Public Health Minneapolis, MN 55455, USA Zijian Guo EMAIL Department of Statistics Rutgers University Piscataway, NJ 08854-8019, USA Tianxi Cai EMAIL Department of Biostatistics Harvard T.H. Chan School of Public Health Boston, MA 02115, USA |
| Pseudocode | No | The paper describes methods and procedures in narrative text and mathematical formulations but does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not explicitly state that source code for the described methodology is being released or provide a link to a code repository. The license mentioned refers to the paper itself, not the code ('License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v24/21-1075.html.'). |
| Open Datasets | No | We applied the proposed SAS method to the risk prediction of Type II Diabetes Mellitus (T2DM) using EHR and genomic data of participants of the Mass General Brigham Biobank study. The paper mentions using data from a specific study (Mass General Brigham Biobank study) but does not provide concrete access information (e.g., URL, DOI, or specific citation to a publicly accessible repository) for this dataset. |
| Dataset Splits | Yes | To compare the performance of different risk prediction models, we use 10-fold cross-validation to estimate the out-of-sample AUC. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used to run the experiments. It mentions running 'simulations' and 'applying the method to genetic risk prediction' but without any hardware specifications. |
| Software Dependencies | No | The paper does not explicitly mention any software dependencies with specific version numbers. |
| Experiment Setup | Yes | Throughout, we let p = 500, q = 100, N = 20000 and consider n = 500. For each configuration, we summarize the results based on 500 simulated datasets. We let K = 5 in cross-fitting and use 5-fold cross-validation for tuning parameter selection. To compare the performance of different risk prediction models, we use 10-fold cross-validation to estimate the out-of-sample AUC. We repeated the process 10 times and took average of predicted probabilities across the repeats for each labelled sample and method in comparison. |