Towards Continuous Scientific Data Analysis and Hypothesis Evolution
Authors: Yolanda Gil, Daniel Garijo, Varun Ratnakar, Rajiv Mayani, Ravali Adusumilli, Hunter Boyce, Arunima Srivastava, Parag Mallick
AAAI 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We implemented our approach in the DISK framework, and evaluated it using two scenarios from cancer multi-omics: 1) data for new patients becomes available over time, 2) new types of data for the same patients are released. We show that in all scenarios DISK updates the confidence on the original hypotheses as it automatically analyzes new data. |
| Researcher Affiliation | Academia | 1 Information Sciences Institute, University of Southern California, 4676 Admiralty Way, Marina del Rey CA, 90292, USA. {gil, dgarijo, varunr, mayani}@isi.edu 2Stanford School of Medicine, Canary Center for Early Cancer Detection, Stanford University 1265 Welch Road, Stanford CA 94305, USA. {ravali, hboyce, arus, paragm}@stanford.edu |
| Pseudocode | No | The paper describes the system architecture and workflows but does not present any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | Yes | We have implemented our approach in the DISK framework [Ratnakar 2016]. ... Ratnakar, V. DISK software (v1.0.0). Zenodo. 2016. http://doi.org/10.5281/zenodo.168079 |
| Open Datasets | Yes | We use 84 datasets of genomic and proteomic data from 42 different patient samples [Adusumilli 2016]. ... Adusumilli, R. Datasets used in [Gil et al 2016] for AAAI 2017 . Zenodo. 2016. http://doi.org/10.5281/zenodo.180716. ... Projects like The Cancer Genome Atlas (TCGA) [Tomczak et al 2015] and the associated Clinical Proteomic Tumor Analysis Consortium (CPTAC) [Rudnick et al. 2016] are creating large repositories of omics data |
| Dataset Splits | No | The paper describes scenarios of data becoming available over time but does not specify formal training, validation, or test dataset splits (e.g., percentages, sample counts, or explicit cross-validation methods). |
| Hardware Specification | No | When executed linearly, the workflows in the analysis take 336 CPU hours on a single machine. When parallelized, the CPU time is approximately 35 hours. The paper mentions 'single machine' and 'CPU hours' but does not specify any particular CPU model, GPU, or other hardware components used for the experiments. |
| Software Dependencies | No | We have 3 different lines of inquiry with workflows that include popular omics analysis tools such as X!!Tandem [Bjornson et al 2008] and Top Hat2 [Kim et al 2013], custom Pro DB [Wang and Zhang 2013], SAMtools [Li 2009], Peptide Prophet [Keller et al 2002], and Protein Prophet [Nesvizhskii et al 2003]. While specific tools are mentioned, their version numbers are not provided. |
| Experiment Setup | No | The paper describes the scenarios (data availability over time) and the general approach of triggering lines of inquiry and calculating confidence values. However, it does not specify concrete experimental setup details such as hyperparameters, specific training configurations, or model initialization settings for any underlying machine learning components. |