Enhancing Protein Mutation Effect Prediction through a Retrieval-Augmented Framework

Authors: Ruihan Guo, Rui Wang, Ruidong Wu, Zhizhou Ren, Jiahan Li, Shitong Luo, Zuofan Wu, Qiang Liu, Jian Peng, Jianzhu Ma

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our findings demonstrate that leveraging this method results in the SOTA performance across multiple protein mutation prediction datasets, and offers a scalable solution for studying mutation effects.
Researcher Affiliation Collaboration 1Helixon Research, 2The University of Texas at Austin University 3Institute for AI Industry Research, Tsinghua
Pseudocode Yes Algorithm 1 MSM-IPA Information Fusion Algorithm
Open Source Code Yes The code is available at https://github.com/guoruihan/MSM-Mut
Open Datasets Yes We preprocess the entire Protein Data Bank (PDB) [Berman et al., 2003] and build a database, we call Structure Motif Embedding Database (SMEDB)... Our model is extensively evaluated on a suite of widely-used protein stability and binding affinity benchmarks, including S669 [Pancotti et al., 2022], c DNA [Tsuboyama et al., 2023], and SKEMPI [Jankauskait e et al., 2019]... To demonstrate our model s robust generalization capability on new data, we tested it on a novel enzyme thermostability dataset provided by Novozymes [Pultz et al., 2022].
Dataset Splits Yes The dataset is split into training, validation, and test sets in a ratio of 95%:0.5%:4.5%. For the PPI surface mutation effect prediction task, we perform 3-fold cross-validation on the SKEMPI dataset, partitioned by PDBID. Two of these folds were further divided into training and validation sets in a 95:5 ratio based on PDB IDs, while the remaining fold was used as the test set.
Hardware Specification Yes As we exclusively use the encoder module of ESM-IF, the computational cost remains manageable, requiring approximately 3 days on 32 A100 GPUs. With a highly efficient CUDA implementation [Yoon, 2021] the time consumption querying the top 105 neighbors of a result is about 8 seconds in 8 A100 GPUs and only 0.5 seconds for the top 103 neighbors.
Software Dependencies No The paper mentions 'ESM-IF' and 'CUHNSW' as software implementations but does not specify their version numbers. For instance: 'Our approach uses ESM-IF as a pretrain model and CUHNSW for vector database implementation.'
Experiment Setup Yes Our model training comprises two phases. In the pretraining phase, we utilize a meticulously curated dataset from the Protein Data Bank-REDO (PDB-REDO) [Joosten et al., 2014] for our pretraining data. The dataset is split into training, validation, and test sets in a ratio of 95%:0.5%:4.5%. The pretraining process involves an initial 200,000 steps without the inclusion of retrieved structure motifs, followed by an additional 30,000 steps incorporating retrieved structure motifs. During training, a random amino acid was selected, and its 256 nearest amino acids are extracted with their amino acid types and backbone atom positions.