Enhancing Protein Mutation Effect Prediction through a Retrieval-Augmented Framework
Authors: Ruihan Guo, Rui Wang, Ruidong Wu, Zhizhou Ren, Jiahan Li, Shitong Luo, Zuofan Wu, Qiang Liu, Jian Peng, Jianzhu Ma
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our findings demonstrate that leveraging this method results in the SOTA performance across multiple protein mutation prediction datasets, and offers a scalable solution for studying mutation effects. |
| Researcher Affiliation | Collaboration | 1Helixon Research, 2The University of Texas at Austin University 3Institute for AI Industry Research, Tsinghua |
| Pseudocode | Yes | Algorithm 1 MSM-IPA Information Fusion Algorithm |
| Open Source Code | Yes | The code is available at https://github.com/guoruihan/MSM-Mut |
| Open Datasets | Yes | We preprocess the entire Protein Data Bank (PDB) [Berman et al., 2003] and build a database, we call Structure Motif Embedding Database (SMEDB)... Our model is extensively evaluated on a suite of widely-used protein stability and binding affinity benchmarks, including S669 [Pancotti et al., 2022], c DNA [Tsuboyama et al., 2023], and SKEMPI [Jankauskait e et al., 2019]... To demonstrate our model s robust generalization capability on new data, we tested it on a novel enzyme thermostability dataset provided by Novozymes [Pultz et al., 2022]. |
| Dataset Splits | Yes | The dataset is split into training, validation, and test sets in a ratio of 95%:0.5%:4.5%. For the PPI surface mutation effect prediction task, we perform 3-fold cross-validation on the SKEMPI dataset, partitioned by PDBID. Two of these folds were further divided into training and validation sets in a 95:5 ratio based on PDB IDs, while the remaining fold was used as the test set. |
| Hardware Specification | Yes | As we exclusively use the encoder module of ESM-IF, the computational cost remains manageable, requiring approximately 3 days on 32 A100 GPUs. With a highly efficient CUDA implementation [Yoon, 2021] the time consumption querying the top 105 neighbors of a result is about 8 seconds in 8 A100 GPUs and only 0.5 seconds for the top 103 neighbors. |
| Software Dependencies | No | The paper mentions 'ESM-IF' and 'CUHNSW' as software implementations but does not specify their version numbers. For instance: 'Our approach uses ESM-IF as a pretrain model and CUHNSW for vector database implementation.' |
| Experiment Setup | Yes | Our model training comprises two phases. In the pretraining phase, we utilize a meticulously curated dataset from the Protein Data Bank-REDO (PDB-REDO) [Joosten et al., 2014] for our pretraining data. The dataset is split into training, validation, and test sets in a ratio of 95%:0.5%:4.5%. The pretraining process involves an initial 200,000 steps without the inclusion of retrieved structure motifs, followed by an additional 30,000 steps incorporating retrieved structure motifs. During training, a random amino acid was selected, and its 256 nearest amino acids are extracted with their amino acid types and backbone atom positions. |