Multi-resolution modeling of a discrete stochastic process identifies causes of cancer
Authors: Adam Uri Yaari, Maxwell Sherman, Oliver Clarke Priebe, Po-Ru Loh, Boris Katz, Andrei Barbu, Bonnie Berger
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Here, we present the split Poisson-Gamma (SPG) distribution, an extension of the classical Poisson-Gamma formulation, to model a discrete stochastic process at multiple resolutions. We demonstrate that the probability model has a closed-form posterior, enabling efficient and accurate linear-time prediction over any length scale after the parameters of the model have been inferred a single time. We apply our framework to model mutation rates in tumors and show that model parameters can be accurately inferred from high-dimensional epigenetic data using a convolutional neural network, Gaussian process, and maximum-likelihood estimation. Our method is both more accurate and more efficient than existing models over a large range of length scales. We demonstrate the usefulness of multi-resolution modeling by detecting genomic elements that drive tumor emergence and are of vastly differing sizes. |
| Researcher Affiliation | Academia | 1 MIT CSAIL, 2 MIT CBMM, 3 Broad Institute of MIT and Harvard, 4 MIT Department of Mathematics, 5 Division of Genetics, Brigham and Women s Hospital 6 Department of Physics, University of Pennsylvania {yaari,maxas,priebeo,boris,abarbu,bab}@csail.mit.edu, poruloh@broadinstitute.org |
| Pseudocode | No | The paper describes the model and methods textually and mathematically but does not include any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | No | The paper does not provide an explicit statement about releasing source code for the described methodology or a link to a code repository. |
| Open Datasets | Yes | We obtained publicly available mutation counts from four cancer cohorts previously characterized by the Pan-Cancer Analysis of Whole Genomes Consortium (PCAWG) (Campbell et al., 2020) [...]. We obtained 733 datasets characterizing the patterns of these chemical modifications in 111 human tissues from Roadmap Epigenomics (Roadmap Epigenomics Consortium et al., 2015). |
| Dataset Splits | Yes | Before training, high-quality data regions were strictly split into train (64%), validation (16%) and test (20%) sets. |
| Hardware Specification | Yes | A benchmark run at 10kb scale with 10 GP reruns takes 2-3 hours on a single 24 Gb Nvidia RTX GPU, with 8 CPU cores and 756GB RAM. |
| Software Dependencies | No | The paper mentions software packages like "Pytorch Paszke et al. (2017)", "Python s GPy Torch package Gardner et al. (2019)", "Python Scikit-learn package", and "Python statsmodels package Seabold & Perktold (2010)". While it cites the papers, it does not provide explicit version numbers for the software libraries themselves, which is required for reproducibility. |
| Experiment Setup | Yes | All networks were independently trained for 20 epochs with a batch size of 128 samples and using the Adam optimizer to minimize mean squared error loss. [...] The GP was optimized with 2000 inducing points using the Adam optimizer for 100 steps. [...] Before training, high-quality data regions were strictly split into train (64%), validation (16%) and test (20%) sets. |