Learning Mutational Semantics
Authors: Brian Hie, Ellen Zhong, Bryan Bryson, Bonnie Berger
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We report good empirical performance on CSCS of single-word mutations to news headlines, map a continuous semantic space of viral variation, and, notably, show unprecedented zero-shot prediction of single-residue escape mutations to key influenza and HIV proteins, suggesting a productive link between modeling natural language and pathogenic evolution.1 |
| Researcher Affiliation | Academia | Brian Hie MIT brianhie@mit.edu Ellen D. Zhong MIT zhonge@mit.edu Bryan D. Bryson MIT bryand@mit.edu Bonnie Berger MIT bab@mit.edu |
| Pseudocode | No | The paper does not contain a pseudocode or algorithm block. |
| Open Source Code | Yes | Code at https://github.com/brianhie/mutational-semantics-neurips2020. |
| Open Datasets | Yes | Our training corpus consisted of 1,186,018 headlines from the Australian Broadcasting Corporation from 2003 through 2019 (Appendix 6.1.1) [34]. Our training data consists of 44,999 unique influenza A hemagglutinin (HA) amino acid sequences (around 550 residues in length) observed in animal hosts from 1908 through 2019. ...Data was obtained from the NIAID Influenza Research Database (IRD) [55] through the web site at http://www.fludb.org (Appendix 6.1.2). We train our language model on 60,857 unique Env sequences from the Los Alamos National Laboratory (LANL) HIV database (Appendix 6.1.3) [23]. |
| Dataset Splits | Yes | We selected our model architecture by holding out a test set of headlines from 2016 onward (179,887 headlines, about 15%) and evaluating cross entropy loss for the language modeling task. We used a cross-validation strategy within the training set to grid search hyperparameters (Appendix 6.3.1). |
| Hardware Specification | No | The paper does not specify the hardware used for running the experiments. |
| Software Dependencies | No | The paper mentions using NLTK POS tagger [10] and FLAIR POS tagger [3], but does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | In our experiments, we used a 20-dimensional dense embedding for each element in the alphabet X, two Bi LSTM layers with 512 units, and categorical cross entropy loss optimized by Adam with a learning rate of 0.001, β1 = 0.9, and β2 = 0.999. |