MutaPLM: Protein Language Modeling for Mutation Explanation and Engineering

Authors: Yizhen Luo, Zikun Nie, Massimo Hong, Suyuan Zhao, Hao Zhou, Zaiqing Nie

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through comprehensive experiments, we demonstrate that Muta PLM excels at providing human-understandable explanations for mutational effects and prioritizing novel mutations with desirable properties.
Researcher Affiliation Collaboration 1Institute of AI Industry Research (AIR), Tsinghua University 2Department of Computer Science and Technology, Tsinghua University 3Pharmolix Inc.
Pseudocode Yes Algorithm 1 Multi-round Optimization with Beam Search
Open Source Code Yes Our code, model, and data are open-sourced at https://github.com/PharMolix/MutaPLM.
Open Datasets Yes We also construct Muta Describe, the first large-scale protein mutation dataset with rich textual annotations, which provides cross-modal supervision signals. We build Muta Describe, a large-scale dataset comprising 20.9K wild-type proteins and 171.1K single-site mutations, to facilitate fine-tuning and evaluation. The primary source of Muta Describe is Uni Prot KB/Swiss Prot [69], a widely adopted protein database that contains 106.6K single-site substitutions. We collect expert-reviewed descriptions of mutational effects from the Phenotypes & Variants entry and retrieve the abstract of the corresponding publications on Pub Med [70] based on available reference information.
Dataset Splits Yes We first randomly split our dataset into training, validation, and test sets. To evaluate models generalization capabilities on novel proteins, we further partition the test set into three subsets based on the wild-type sequence homology with training sequences. We adopt MMSeqs2 [71], a widely-adopted tool to calculate sequence homology. The Easy, Medium and Hard test subsets comprise samples whose sequence homology are between [0.95, 1], [0.5, 0.95), and [0, 0.5) respectively. We also implement a temporal split based on the publication date of the mutation, and we defer readers to Appendix B for details and Appendix D.1 for evaluation results.
Hardware Specification Yes The overall training process takes 10 days on 4 NVIDIA A100 GPUs.
Software Dependencies No The paper mentions specific models and tools used (e.g., ESM-2 (650M), Bio Med GPT-LM, LLa MA2-7B, GPT-3.5-turbo, GPT-4, MMSeqs2) and their references, but it does not specify programming language versions (e.g., Python 3.x) or library versions (e.g., PyTorch 1.x) that are part of the ancillary software dependencies.
Experiment Setup Yes We apply low-rank adaptation (Lo RA) [77] on the LLM with a rank of 16. The number of query embeds and soft tokens is set as K = 32. We optimize the Lo RA modules, the wild-type encoder, the delta encoder, the delta decoder, the soft tokens, the position head, and the language modeling (LM) head, which comprises a total of 75.0M parameters. The remaining 7.4B parameters are kept frozen. We pre-train Muta PLM for 200K steps with a batch size of 32 on 1.1M protein-text data collected from biomedical publications (detailed in Appendix B.1) and fine-tune it for 70K steps with a batch size of 24 on Muta Describe. For both stages, we use the Adam W optimizer [78] with a learning rate that is linearly warmed up to 10^-4 for the first 1K steps and decreases to 10^-5 following a cosine annealing strategy.