Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Multi-resolution modeling of a discrete stochastic process identifies causes of cancer
Authors: Adam Uri Yaari, Maxwell Sherman, Oliver Clarke Priebe, Po-Ru Loh, Boris Katz, Andrei Barbu, Bonnie Berger
ICLR 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Here, we present the split Poisson-Gamma (SPG) distribution, an extension of the classical Poisson-Gamma formulation, to model a discrete stochastic process at multiple resolutions. We demonstrate that the probability model has a closed-form posterior, enabling efficient and accurate linear-time prediction over any length scale after the parameters of the model have been inferred a single time. We apply our framework to model mutation rates in tumors and show that model parameters can be accurately inferred from high-dimensional epigenetic data using a convolutional neural network, Gaussian process, and maximum-likelihood estimation. Our method is both more accurate and more efficient than existing models over a large range of length scales. We demonstrate the usefulness of multi-resolution modeling by detecting genomic elements that drive tumor emergence and are of vastly differing sizes. |
| Researcher Affiliation | Academia | 1 MIT CSAIL, 2 MIT CBMM, 3 Broad Institute of MIT and Harvard, 4 MIT Department of Mathematics, 5 Division of Genetics, Brigham and Women s Hospital 6 Department of Physics, University of Pennsylvania EMAIL, EMAIL |
| Pseudocode | No | The paper describes the model and methods textually and mathematically but does not include any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | No | The paper does not provide an explicit statement about releasing source code for the described methodology or a link to a code repository. |
| Open Datasets | Yes | We obtained publicly available mutation counts from four cancer cohorts previously characterized by the Pan-Cancer Analysis of Whole Genomes Consortium (PCAWG) (Campbell et al., 2020) [...]. We obtained 733 datasets characterizing the patterns of these chemical modifications in 111 human tissues from Roadmap Epigenomics (Roadmap Epigenomics Consortium et al., 2015). |
| Dataset Splits | Yes | Before training, high-quality data regions were strictly split into train (64%), validation (16%) and test (20%) sets. |
| Hardware Specification | Yes | A benchmark run at 10kb scale with 10 GP reruns takes 2-3 hours on a single 24 Gb Nvidia RTX GPU, with 8 CPU cores and 756GB RAM. |
| Software Dependencies | No | The paper mentions software packages like "Pytorch Paszke et al. (2017)", "Python s GPy Torch package Gardner et al. (2019)", "Python Scikit-learn package", and "Python statsmodels package Seabold & Perktold (2010)". While it cites the papers, it does not provide explicit version numbers for the software libraries themselves, which is required for reproducibility. |
| Experiment Setup | Yes | All networks were independently trained for 20 epochs with a batch size of 128 samples and using the Adam optimizer to minimize mean squared error loss. [...] The GP was optimized with 2000 inducing points using the Adam optimizer for 100 steps. [...] Before training, high-quality data regions were strictly split into train (64%), validation (16%) and test (20%) sets. |