Multi-resolution modeling of a discrete stochastic process identifies causes of cancer

Authors: Adam Uri Yaari, Maxwell Sherman, Oliver Clarke Priebe, Po-Ru Loh, Boris Katz, Andrei Barbu, Bonnie Berger

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Here, we present the split Poisson-Gamma (SPG) distribution, an extension of the classical Poisson-Gamma formulation, to model a discrete stochastic process at multiple resolutions. We demonstrate that the probability model has a closed-form posterior, enabling efficient and accurate linear-time prediction over any length scale after the parameters of the model have been inferred a single time. We apply our framework to model mutation rates in tumors and show that model parameters can be accurately inferred from high-dimensional epigenetic data using a convolutional neural network, Gaussian process, and maximum-likelihood estimation. Our method is both more accurate and more efficient than existing models over a large range of length scales. We demonstrate the usefulness of multi-resolution modeling by detecting genomic elements that drive tumor emergence and are of vastly differing sizes.
Researcher Affiliation Academia 1 MIT CSAIL, 2 MIT CBMM, 3 Broad Institute of MIT and Harvard, 4 MIT Department of Mathematics, 5 Division of Genetics, Brigham and Women s Hospital 6 Department of Physics, University of Pennsylvania {yaari,maxas,priebeo,boris,abarbu,bab}@csail.mit.edu, poruloh@broadinstitute.org
Pseudocode No The paper describes the model and methods textually and mathematically but does not include any pseudocode or clearly labeled algorithm blocks.
Open Source Code No The paper does not provide an explicit statement about releasing source code for the described methodology or a link to a code repository.
Open Datasets Yes We obtained publicly available mutation counts from four cancer cohorts previously characterized by the Pan-Cancer Analysis of Whole Genomes Consortium (PCAWG) (Campbell et al., 2020) [...]. We obtained 733 datasets characterizing the patterns of these chemical modifications in 111 human tissues from Roadmap Epigenomics (Roadmap Epigenomics Consortium et al., 2015).
Dataset Splits Yes Before training, high-quality data regions were strictly split into train (64%), validation (16%) and test (20%) sets.
Hardware Specification Yes A benchmark run at 10kb scale with 10 GP reruns takes 2-3 hours on a single 24 Gb Nvidia RTX GPU, with 8 CPU cores and 756GB RAM.
Software Dependencies No The paper mentions software packages like "Pytorch Paszke et al. (2017)", "Python s GPy Torch package Gardner et al. (2019)", "Python Scikit-learn package", and "Python statsmodels package Seabold & Perktold (2010)". While it cites the papers, it does not provide explicit version numbers for the software libraries themselves, which is required for reproducibility.
Experiment Setup Yes All networks were independently trained for 20 epochs with a batch size of 128 samples and using the Adam optimizer to minimize mean squared error loss. [...] The GP was optimized with 2000 inducing points using the Adam optimizer for 100 steps. [...] Before training, high-quality data regions were strictly split into train (64%), validation (16%) and test (20%) sets.