Non-identifiability and the Blessings of Misspecification in Models of Molecular Fitness
Authors: Eli Weinstein, Alan Amin, Jonathan Frazer, Debora Marks
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show on real datasets that perfect density estimation in the limit of infinite data would, with high confidence, result in poor fitness estimation; current models perform accurate fitness estimation because of, not despite, misspecification. Fifth, we apply our test to over 100 separate sequence datasets and fitness estimation tasks, to conclude that existing fitness estimation models systematically outperform the true data distribution p0 at estimating fitness (Sec. 7). |
| Researcher Affiliation | Academia | Eli N. Weinstein Columbia University ew2760@columbia.edu Alan N. Amin Harvard Medical School alanamin@g.harvard.edu Jonathan Frazer Harvard Medical School and Centre for Genomic Regulation (CRG) jonathan.frazer@crg.eu Debora S. Marks Harvard Medical School and Broad Institute of Harvard and MIT debbie@hms.harvard.edu |
| Pseudocode | No | No structured pseudocode or algorithm blocks were found in the paper. |
| Open Source Code | No | No explicit statement about releasing source code or a link to a code repository for the methodology described in this paper was found. |
| Open Datasets | Yes | For the first task, we considered 37 different assays across 32 different protein families, and for the second task, 97 genes across 87 protein families; for each protein family, we assembled datasets of evolutionarily related sequences, following previous work. Models We considered three existing fitness estimation models: a site-wise independent model (SWI), a Bayesian variational autoencoder (EVE [19], which is similar to Deep Sequence [44]), and a deep autoregressive model (Wavenet) [50]. |
| Dataset Splits | No | No specific percentages, absolute sample counts, or detailed splitting methodology for training, validation, and test sets are provided in the main text. The paper mentions 'heldout data' but does not quantify the splits. |
| Hardware Specification | No | No specific hardware details (e.g., GPU/CPU models, processor types, memory amounts, or detailed computer specifications) used for running experiments were found in the paper. |
| Software Dependencies | No | No specific software dependencies, libraries, or solvers with version numbers (e.g., 'Python 3.8', 'PyTorch 1.9') were found in the paper. |
| Experiment Setup | No | The paper discusses the use of maximum likelihood estimation or approximate Bayesian inference and mentions a hyperparameter `h` in the context of the BEAR model, but does not provide specific numerical values for hyperparameters (e.g., learning rate, batch size, number of epochs) or detailed training configurations in the main text. |