reproducibilityindex.ai

Non-identifiability and the Blessings of Misspecification in Models of Molecular Fitness

Authors: Eli Weinstein, Alan Amin, Jonathan Frazer, Debora Marks

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show on real datasets that perfect density estimation in the limit of inﬁnite data would, with high conﬁdence, result in poor ﬁtness estimation; current models perform accurate ﬁtness estimation because of, not despite, misspeciﬁcation. Fifth, we apply our test to over 100 separate sequence datasets and ﬁtness estimation tasks, to conclude that existing ﬁtness estimation models systematically outperform the true data distribution p0 at estimating ﬁtness (Sec. 7).
Researcher Affiliation	Academia	Eli N. Weinstein Columbia University ew2760@columbia.edu Alan N. Amin Harvard Medical School alanamin@g.harvard.edu Jonathan Frazer Harvard Medical School and Centre for Genomic Regulation (CRG) jonathan.frazer@crg.eu Debora S. Marks Harvard Medical School and Broad Institute of Harvard and MIT debbie@hms.harvard.edu
Pseudocode	No	No structured pseudocode or algorithm blocks were found in the paper.
Open Source Code	No	No explicit statement about releasing source code or a link to a code repository for the methodology described in this paper was found.
Open Datasets	Yes	For the ﬁrst task, we considered 37 different assays across 32 different protein families, and for the second task, 97 genes across 87 protein families; for each protein family, we assembled datasets of evolutionarily related sequences, following previous work. Models We considered three existing ﬁtness estimation models: a site-wise independent model (SWI), a Bayesian variational autoencoder (EVE [19], which is similar to Deep Sequence [44]), and a deep autoregressive model (Wavenet) [50].
Dataset Splits	No	No specific percentages, absolute sample counts, or detailed splitting methodology for training, validation, and test sets are provided in the main text. The paper mentions 'heldout data' but does not quantify the splits.
Hardware Specification	No	No specific hardware details (e.g., GPU/CPU models, processor types, memory amounts, or detailed computer specifications) used for running experiments were found in the paper.
Software Dependencies	No	No specific software dependencies, libraries, or solvers with version numbers (e.g., 'Python 3.8', 'PyTorch 1.9') were found in the paper.
Experiment Setup	No	The paper discusses the use of maximum likelihood estimation or approximate Bayesian inference and mentions a hyperparameter `h` in the context of the BEAR model, but does not provide specific numerical values for hyperparameters (e.g., learning rate, batch size, number of epochs) or detailed training configurations in the main text.