A generative nonparametric Bayesian model for whole genomes
Authors: Alan Amin, Eli N Weinstein, Debora Marks
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We explore, theoretically and empirically, applications of BEAR models to a variety of statistical problems including density estimation, robust parameter estimation, goodness-of-fit tests, and two-sample tests. On genomic, transcriptomic, and metagenomic sequence data we show that BEAR models provide large increases in predictive performance as compared to parametric autoregressive models, among other results. |
| Researcher Affiliation | Academia | Alan N. Aminú 1,2, Eli N. Weinsteinú 2,3 and Debora S. Marks2,4 1 Program in Systems, Synthetic and Quantitative Biology 2 Department of Systems Biology Harvard Medical School 3 Program in Biophysics, Harvard University 4 Broad Institute of Harvard and MIT |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available at https://github. com/debbiemarkslab/BEAR. |
| Open Datasets | Yes | We considered eleven datasets of four different types: whole genome sequencing read data, single cell RNA sequencing read data (including from patient tumors), metagenomic sequencing read data (including from patient fecal samples) and full bacterial genomes from across the tree of life (Section K). [1] 1001 Genomes Consortium. 1,135 genomes reveal the global pattern of polymorphism in Arabidopsis thaliana. [64] P. W. Schreiber et al. Metagenomic virome sequencing... revealed JC polyomavirus transmission. |
| Dataset Splits | No | The paper states '25% of data was randomly held out for testing', implying the remaining 75% for training, but it does not specify a separate validation split percentage or method. |
| Hardware Specification | No | The paper discusses processing 'terabyte-scale datasets' and using a 'high-performance kmer counter', but it does not provide specific details such as GPU/CPU models, memory, or cloud instance types used for the experiments. |
| Software Dependencies | Yes | Using a high-performance kmer counter optimized for nucleotide data, KMC, we can compute the count matrix #( , ) for all observed kmers k in terabyte-scale datasets, even when the matrix does not fit in main memory (Section J.2) [39]. [39] M. Kokot, M. Dlugosz, and S. Deorowicz. KMC 3: counting and manipulating k-mer statistics. |
| Experiment Setup | No | The paper mentions 'empirical Bayes methods that optimize point estimates of L, h and ' and 'standard stochastic gradient-based optimization', but it does not provide specific hyperparameter values like learning rate, batch size, or number of epochs. |