Learning Mutational Semantics

Authors: Brian Hie, Ellen Zhong, Bryan Bryson, Bonnie Berger

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We report good empirical performance on CSCS of single-word mutations to news headlines, map a continuous semantic space of viral variation, and, notably, show unprecedented zero-shot prediction of single-residue escape mutations to key influenza and HIV proteins, suggesting a productive link between modeling natural language and pathogenic evolution.1
Researcher Affiliation Academia Brian Hie MIT brianhie@mit.edu Ellen D. Zhong MIT zhonge@mit.edu Bryan D. Bryson MIT bryand@mit.edu Bonnie Berger MIT bab@mit.edu
Pseudocode No The paper does not contain a pseudocode or algorithm block.
Open Source Code Yes Code at https://github.com/brianhie/mutational-semantics-neurips2020.
Open Datasets Yes Our training corpus consisted of 1,186,018 headlines from the Australian Broadcasting Corporation from 2003 through 2019 (Appendix 6.1.1) [34]. Our training data consists of 44,999 unique influenza A hemagglutinin (HA) amino acid sequences (around 550 residues in length) observed in animal hosts from 1908 through 2019. ...Data was obtained from the NIAID Influenza Research Database (IRD) [55] through the web site at http://www.fludb.org (Appendix 6.1.2). We train our language model on 60,857 unique Env sequences from the Los Alamos National Laboratory (LANL) HIV database (Appendix 6.1.3) [23].
Dataset Splits Yes We selected our model architecture by holding out a test set of headlines from 2016 onward (179,887 headlines, about 15%) and evaluating cross entropy loss for the language modeling task. We used a cross-validation strategy within the training set to grid search hyperparameters (Appendix 6.3.1).
Hardware Specification No The paper does not specify the hardware used for running the experiments.
Software Dependencies No The paper mentions using NLTK POS tagger [10] and FLAIR POS tagger [3], but does not provide specific version numbers for these software components.
Experiment Setup Yes In our experiments, we used a 20-dimensional dense embedding for each element in the alphabet X, two Bi LSTM layers with 512 units, and categorical cross entropy loss optimized by Adam with a learning rate of 0.001, β1 = 0.9, and β2 = 0.999.