A generative nonparametric Bayesian model for whole genomes

Authors: Alan Amin, Eli N Weinstein, Debora Marks

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We explore, theoretically and empirically, applications of BEAR models to a variety of statistical problems including density estimation, robust parameter estimation, goodness-of-fit tests, and two-sample tests. On genomic, transcriptomic, and metagenomic sequence data we show that BEAR models provide large increases in predictive performance as compared to parametric autoregressive models, among other results.
Researcher Affiliation Academia Alan N. Aminú 1,2, Eli N. Weinsteinú 2,3 and Debora S. Marks2,4 1 Program in Systems, Synthetic and Quantitative Biology 2 Department of Systems Biology Harvard Medical School 3 Program in Biophysics, Harvard University 4 Broad Institute of Harvard and MIT
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes Code is available at https://github. com/debbiemarkslab/BEAR.
Open Datasets Yes We considered eleven datasets of four different types: whole genome sequencing read data, single cell RNA sequencing read data (including from patient tumors), metagenomic sequencing read data (including from patient fecal samples) and full bacterial genomes from across the tree of life (Section K). [1] 1001 Genomes Consortium. 1,135 genomes reveal the global pattern of polymorphism in Arabidopsis thaliana. [64] P. W. Schreiber et al. Metagenomic virome sequencing... revealed JC polyomavirus transmission.
Dataset Splits No The paper states '25% of data was randomly held out for testing', implying the remaining 75% for training, but it does not specify a separate validation split percentage or method.
Hardware Specification No The paper discusses processing 'terabyte-scale datasets' and using a 'high-performance kmer counter', but it does not provide specific details such as GPU/CPU models, memory, or cloud instance types used for the experiments.
Software Dependencies Yes Using a high-performance kmer counter optimized for nucleotide data, KMC, we can compute the count matrix #( , ) for all observed kmers k in terabyte-scale datasets, even when the matrix does not fit in main memory (Section J.2) [39]. [39] M. Kokot, M. Dlugosz, and S. Deorowicz. KMC 3: counting and manipulating k-mer statistics.
Experiment Setup No The paper mentions 'empirical Bayes methods that optimize point estimates of L, h and ' and 'standard stochastic gradient-based optimization', but it does not provide specific hyperparameter values like learning rate, batch size, or number of epochs.