Full-Atom Peptide Design based on Multi-modal Flow Matching

Authors: Jiahan Li, Chaoran Cheng, Zuofan Wu, Ruihan Guo, Shitong Luo, Zhizhou Ren, Jian Peng, Jianzhu Ma

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we conduct a comprehensive evaluation of Pep Flow across three tasks: (1) Sequence-Structure Codesign, (2) Fix-Backbone Sequence Design, and (3) Side Chain Packing. We introduce a new benchmark dataset derived from Pep BDB (Wen et al., 2019) and Q-Bio Lip (Wei et al., 2024). After removing duplicate entries and applying empirical criteria (e.g. resolution <4 A, peptide length between 3 25), we cluster these complexes according to 40% peptide sequence identity using mmseqs2 (Steinegger & S oding, 2017). This results in 8, 365 non-orphan complexes distributed across 292 clusters. To construct the test set, We randomly select 10 clusters containing 158 com-plexes, while the remaining complexes are used for training and validation.We implement and compare the performance of three variants of our model. Pep Flow w/Bb only samples backbones; Pep Flow w/Bb+Seq is used for modeling backbones and sequences jointly; and Pep Flow w/Bb+Seq+Ang finally model the full-atom distributions of peptides. Experimental details and additional results are provided in Appendix B.Table 1. Evaluation of methods in the sequence-structure co-design task.
Researcher Affiliation Collaboration 1Helixon Research 2Institute for AI Industry Research, Tsinghua University 3Department of Computer Science, University of Illinois Urbana-Champaign.
Pseudocode Yes Algorithm 1 Training Multi-Modal Pep Flow Algorithm 2 Sampling with Multi-Modal Pep Flow
Open Source Code Yes Code and data are available at https://github.com/Ced3-han/PepFlowww.
Open Datasets Yes We introduce a new benchmark dataset derived from Pep BDB (Wen et al., 2019) and Q-Bio Lip (Wei et al., 2024).
Dataset Splits No The paper states: "To construct the test set, We randomly select 10 clusters containing 158 complexes, while the remaining complexes are used for training and validation." However, it does not specify the exact split percentage or count between training and validation sets within the "remaining complexes".
Hardware Specification Yes All three models are trained on 8 NVIDIA A100 GPUs using a DDP distributed training scheme for 40k iterations. We execute the sampling process of our model on a single NVIDIA A100, employing 200 equal-spaced timesteps for the Euler step update and simultaneously sampling 64 peptides for each test case.
Software Dependencies No The paper mentions software tools like Py Rosetta, Rosetta, DSSP, mmseqs2, ESMFold, Prot GPT2, but does not specify their version numbers for reproducibility.
Experiment Setup Yes We set the learning rate at 5e-4 and the batch size at 32 for each distributed node. We execute the sampling process of our model on a single NVIDIA A100, employing 200 equal-spaced timesteps for the Euler step update and simultaneously sampling 64 peptides for each test case.