reproducibilityindex.ai

Enhancing Protein Mutation Effect Prediction through a Retrieval-Augmented Framework

Authors: Ruihan Guo, Rui Wang, Ruidong Wu, Zhizhou Ren, Jiahan Li, Shitong Luo, Zuofan Wu, Qiang Liu, Jian Peng, Jianzhu Ma

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our findings demonstrate that leveraging this method results in the SOTA performance across multiple protein mutation prediction datasets, and offers a scalable solution for studying mutation effects.
Researcher Affiliation	Collaboration	1Helixon Research, 2The University of Texas at Austin University 3Institute for AI Industry Research, Tsinghua
Pseudocode	Yes	Algorithm 1 MSM-IPA Information Fusion Algorithm
Open Source Code	Yes	The code is available at https://github.com/guoruihan/MSM-Mut
Open Datasets	Yes	We preprocess the entire Protein Data Bank (PDB) [Berman et al., 2003] and build a database, we call Structure Motif Embedding Database (SMEDB)... Our model is extensively evaluated on a suite of widely-used protein stability and binding affinity benchmarks, including S669 [Pancotti et al., 2022], c DNA [Tsuboyama et al., 2023], and SKEMPI [Jankauskait e et al., 2019]... To demonstrate our model s robust generalization capability on new data, we tested it on a novel enzyme thermostability dataset provided by Novozymes [Pultz et al., 2022].
Dataset Splits	Yes	The dataset is split into training, validation, and test sets in a ratio of 95%:0.5%:4.5%. For the PPI surface mutation effect prediction task, we perform 3-fold cross-validation on the SKEMPI dataset, partitioned by PDBID. Two of these folds were further divided into training and validation sets in a 95:5 ratio based on PDB IDs, while the remaining fold was used as the test set.
Hardware Specification	Yes	As we exclusively use the encoder module of ESM-IF, the computational cost remains manageable, requiring approximately 3 days on 32 A100 GPUs. With a highly efficient CUDA implementation [Yoon, 2021] the time consumption querying the top 105 neighbors of a result is about 8 seconds in 8 A100 GPUs and only 0.5 seconds for the top 103 neighbors.
Software Dependencies	No	The paper mentions 'ESM-IF' and 'CUHNSW' as software implementations but does not specify their version numbers. For instance: 'Our approach uses ESM-IF as a pretrain model and CUHNSW for vector database implementation.'
Experiment Setup	Yes	Our model training comprises two phases. In the pretraining phase, we utilize a meticulously curated dataset from the Protein Data Bank-REDO (PDB-REDO) [Joosten et al., 2014] for our pretraining data. The dataset is split into training, validation, and test sets in a ratio of 95%:0.5%:4.5%. The pretraining process involves an initial 200,000 steps without the inclusion of retrieved structure motifs, followed by an additional 30,000 steps incorporating retrieved structure motifs. During training, a random amino acid was selected, and its 256 nearest amino acids are extracted with their amino acid types and backbone atom positions.