reproducibilityindex.ai

Learning Mutational Semantics

Authors: Brian Hie, Ellen Zhong, Bryan Bryson, Bonnie Berger

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We report good empirical performance on CSCS of single-word mutations to news headlines, map a continuous semantic space of viral variation, and, notably, show unprecedented zero-shot prediction of single-residue escape mutations to key inﬂuenza and HIV proteins, suggesting a productive link between modeling natural language and pathogenic evolution.1
Researcher Affiliation	Academia	Brian Hie MIT brianhie@mit.edu Ellen D. Zhong MIT zhonge@mit.edu Bryan D. Bryson MIT bryand@mit.edu Bonnie Berger MIT bab@mit.edu
Pseudocode	No	The paper does not contain a pseudocode or algorithm block.
Open Source Code	Yes	Code at https://github.com/brianhie/mutational-semantics-neurips2020.
Open Datasets	Yes	Our training corpus consisted of 1,186,018 headlines from the Australian Broadcasting Corporation from 2003 through 2019 (Appendix 6.1.1) [34]. Our training data consists of 44,999 unique inﬂuenza A hemagglutinin (HA) amino acid sequences (around 550 residues in length) observed in animal hosts from 1908 through 2019. ...Data was obtained from the NIAID Inﬂuenza Research Database (IRD) [55] through the web site at http://www.fludb.org (Appendix 6.1.2). We train our language model on 60,857 unique Env sequences from the Los Alamos National Laboratory (LANL) HIV database (Appendix 6.1.3) [23].
Dataset Splits	Yes	We selected our model architecture by holding out a test set of headlines from 2016 onward (179,887 headlines, about 15%) and evaluating cross entropy loss for the language modeling task. We used a cross-validation strategy within the training set to grid search hyperparameters (Appendix 6.3.1).
Hardware Specification	No	The paper does not specify the hardware used for running the experiments.
Software Dependencies	No	The paper mentions using NLTK POS tagger [10] and FLAIR POS tagger [3], but does not provide specific version numbers for these software components.
Experiment Setup	Yes	In our experiments, we used a 20-dimensional dense embedding for each element in the alphabet X, two Bi LSTM layers with 512 units, and categorical cross entropy loss optimized by Adam with a learning rate of 0.001, β1 = 0.9, and β2 = 0.999.