Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Enhancing Protein Mutation Effect Prediction through a Retrieval-Augmented Framework
Authors: Ruihan Guo, Rui Wang, Ruidong Wu, Zhizhou Ren, Jiahan Li, Shitong Luo, Zuofan Wu, Qiang Liu, Jian Peng, Jianzhu Ma
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our findings demonstrate that leveraging this method results in the SOTA performance across multiple protein mutation prediction datasets, and offers a scalable solution for studying mutation effects. |
| Researcher Affiliation | Collaboration | 1Helixon Research, 2The University of Texas at Austin University 3Institute for AI Industry Research, Tsinghua |
| Pseudocode | Yes | Algorithm 1 MSM-IPA Information Fusion Algorithm |
| Open Source Code | Yes | The code is available at https://github.com/guoruihan/MSM-Mut |
| Open Datasets | Yes | We preprocess the entire Protein Data Bank (PDB) [Berman et al., 2003] and build a database, we call Structure Motif Embedding Database (SMEDB)... Our model is extensively evaluated on a suite of widely-used protein stability and binding affinity benchmarks, including S669 [Pancotti et al., 2022], c DNA [Tsuboyama et al., 2023], and SKEMPI [Jankauskait e et al., 2019]... To demonstrate our model s robust generalization capability on new data, we tested it on a novel enzyme thermostability dataset provided by Novozymes [Pultz et al., 2022]. |
| Dataset Splits | Yes | The dataset is split into training, validation, and test sets in a ratio of 95%:0.5%:4.5%. For the PPI surface mutation effect prediction task, we perform 3-fold cross-validation on the SKEMPI dataset, partitioned by PDBID. Two of these folds were further divided into training and validation sets in a 95:5 ratio based on PDB IDs, while the remaining fold was used as the test set. |
| Hardware Specification | Yes | As we exclusively use the encoder module of ESM-IF, the computational cost remains manageable, requiring approximately 3 days on 32 A100 GPUs. With a highly efficient CUDA implementation [Yoon, 2021] the time consumption querying the top 105 neighbors of a result is about 8 seconds in 8 A100 GPUs and only 0.5 seconds for the top 103 neighbors. |
| Software Dependencies | No | The paper mentions 'ESM-IF' and 'CUHNSW' as software implementations but does not specify their version numbers. For instance: 'Our approach uses ESM-IF as a pretrain model and CUHNSW for vector database implementation.' |
| Experiment Setup | Yes | Our model training comprises two phases. In the pretraining phase, we utilize a meticulously curated dataset from the Protein Data Bank-REDO (PDB-REDO) [Joosten et al., 2014] for our pretraining data. The dataset is split into training, validation, and test sets in a ratio of 95%:0.5%:4.5%. The pretraining process involves an initial 200,000 steps without the inclusion of retrieved structure motifs, followed by an additional 30,000 steps incorporating retrieved structure motifs. During training, a random amino acid was selected, and its 256 nearest amino acids are extracted with their amino acid types and backbone atom positions. |