ContraNovo: A Contrastive Learning Approach to Enhance De Novo Peptide Sequencing
Authors: Zhi Jin, Sheng Xu, Xiang Zhang, Tianze Ling, Nanqing Dong, Wanli Ouyang, Zhiqiang Gao, Cheng Chang, Siqi Sun
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through rigorous evaluations on two benchmark datasets, Contra Novo consistently outshines contemporary state-of-the-art solutions, underscoring its promising potential in enhancing de novo peptide sequencing. Experiments Experimental Setup Datasets. |
| Researcher Affiliation | Collaboration | Zhi Jin1,2*, Sheng Xu3,1 , Xiang Zhang1,4 , Tianze Ling5, Nanqing Dong1, Wanli Ouyang1 , Zhiqiang Gao1 , Cheng Chang5 , Siqi Sun 3,1 1 Shanghai Artificial Intelligence Laboratory, 2 Department of Computer Science, Soochow University, 3 Research Institute of Intelligent Complex Systems, Fudan University, 4 University of British Columbia, 5 National Center for Protein Sciences (Beijing) |
| Pseudocode | No | Not found. The paper describes the architecture and processes in text and mathematical formulas but does not include a distinct pseudocode or algorithm block. |
| Open Source Code | Yes | 1The source code is available at https://github.com/BEAMLabs/Contra Novo. |
| Open Datasets | Yes | Inspired by this, we employed the expansive Mass IVE-KB dataset (Wang et al. 2018) to bolster our contrastive learning representation. This dataset, earlier leveraged by studies like GLEAMS (Bittremieux et al. 2022) and Casa Novo, boasts 30 million high-resolution peptide-spectrum matches (PSMs), derived from a multitude of instruments, and encompassing numerous post-translational modifications. For model validation and benchmarking against leading de novo peptide sequencing techniques, we employed the nine-species benchmark dataset presented by Deep Novo. This dataset collects approximately 1.5 million mass spectra from nine distinct experiments, all utilizing the same instrument but analyzing peptides from different species. Every spectrum is accompanied by a peptide sequence, confirmed through a database search identification at a standard false discovery rate (FDR) of 1%. In Casa Novo s latest iteration, the dataset underwent revision using the protein identification software Crux (Mc Ilwain et al. 2014), in alignment with a Percolator (Spivak et al. 2009) q-value < 0.01, based on the same nine PRIDE datasets (Martens et al. 2005). |
| Dataset Splits | Yes | For model validation and benchmarking against leading de novo peptide sequencing techniques, we employed the nine-species benchmark dataset presented by Deep Novo. This dataset collects approximately 1.5 million mass spectra from nine distinct experiments, all utilizing the same instrument but analyzing peptides from different species. ... Casa Novo adopts a transformer-based architecture to achieve de novo peptide sequencing. It utilizes a leave-one-out crossvalidation framework, where training occurs on data from eight species and testing on the remaining ninth species, iterating this for each of the nine species. |
| Hardware Specification | Yes | During the training process, we used 8 A100 GPUs with a batch size of 4096. |
| Software Dependencies | No | Not found. The paper mentions the Adam W optimizer, but does not specify any software libraries or frameworks with version numbers (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | In this study, we mapped all inputs into a 512-dimensional vector space, which includes peaks, peptides, and amino acids. The peptide embedding specifically comprises 256 dimensions for the index of each amino acid, and another 256 dimensions for the prefix sum and suffix sum of amino acids. The amino acid embedding includes 256 dimensions for the index and 256 dimensions for the mass of each amino acid. The peptide encoder, spectrum encoder, and peptide decoder each employ 9 attention layers. All our attention layers come with 1024 feed forward dimensions. During the training process, we used 8 A100 GPUs with a batch size of 4096. As a result, this setup created a contrastive learning cosine similarity matrix of dimensions 512*512 on each GPU. We set the learning rate at 0.0004 and applied a linear warm-up. For gradient updates, we used the Adam W optimizer (Kingma and Ba 2014). |