Efficiently predicting high resolution mass spectra with graph neural networks

Authors: Michael Murphy, Stefanie Jegelka, Ernest Fraenkel, Tobias Kind, David Healey, Thomas Butler

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We train our model on the NIST-20 tandem MS library (National Institute of Standards and Technology, 2020). This is the largest commercial dataset of high resolution mass spectra of small molecules, curated by expert chemists, and is available for a modest fee. We use an 80/10/10 structure-disjoint train/validation/test split, which we generate by grouping spectra according to the connectivity substring of their InChIKey (Heller et al., 2015), and assigning groups of spectra to splits. As the baseline CFM-ID only predicts monoisotopic spectra at qualitative energy levels {low, medium, high}, we restrict the test set to spectra with corresponding energies {20, 35, 50} in which no peaks were annotated as higher isotopes. This yields 287,995 (18,665) training, 36,265 (2,346) validation, and 4,424 (1,632) test spectra (structures).
Researcher Affiliation Collaboration Michael Murphy * 1 2 Stefanie Jegelka 1 Ernest Fraenkel 2 Tobias Kind 3 David Healey 3 Thomas Butler 3 *The lead author carried out this work as an intern at Enveda Biosciences. 1Computer Science and Artificial Intelligence Laboratory, MIT, Cambridge, MA, USA 2Department of Biological Engineering, MIT, Cambridge, MA, USA 3Enveda Biosciences, Boulder, CO, USA. Correspondence to: Michael Murphy <murphy17@mit.edu>, Thomas Butler <tom.butler@envedabio.com>.
Pseudocode Yes A. Fixed vocabulary selection Algorithm 1 describes our procedure for selecting the product ions ˆP and neutral losses ˆL.
Open Source Code Yes Software and Data We provide code, data, and trained models at https://github.com/murphy17/graff-ms. The NIST-20 license agreement prohibits including spectra from it; we therefore provide instructions on how to obtain it.
Open Datasets Yes We train our model on the NIST-20 tandem MS library (National Institute of Standards and Technology, 2020).
Dataset Splits Yes We use an 80/10/10 structure-disjoint train/validation/test split, which we generate by grouping spectra according to the connectivity substring of their InChIKey (Heller et al., 2015), and assigning groups of spectra to splits. This yields 287,995 (18,665) training, 36,265 (2,346) validation, and 4,424 (1,632) test spectra (structures).
Hardware Specification Yes All models are trained using PyTorch Lightning with automatic mixed precision on 2 Tesla V100 GPUs.
Software Dependencies No The paper mentions PyTorch Lightning but does not specify its version number. It also mentions DGL-Life Sci but no version number.
Experiment Setup Yes We use a vocabulary of K = 10,000 formulas. We train an L = 6-layer encoder and L = 2-layer decoder with denc = 512 and ddec = 1024, resulting in 24.1 million trainable parameters. We use the deig = 8 lowest-frequency eigenvalues, truncating or padding with zeros. Dropout is applied at rate 0.1. We use a batch size of 512 and the Adam optimizer (Kingma & Ba, 2015) with learning rate 5e-4 and weight decay 1e-5. We train for 100 epochs and use the model from the epoch with the lowest validation loss.