SurfPro: Functional Protein Design Based on Continuous Surface

Authors: Zhenqiao Song, Tinglin Huang, Lei Li, Wengong Jin

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate Surf Pro on a standard inverse folding benchmark CATH 4.2 and two functional protein design tasks: protein binder design and enzyme design. Our Surf Pro consistently surpasses previous state-of-the-art inverse folding methods, achieving a recovery rate of 57.78% on CATH 4.2 and higher success rates in terms of protein-protein binding and enzyme-substrate interaction scores.
Researcher Affiliation Academia 1Language Technologies Institute, Carnegie Mellon University, Pittsburgh, the United States. 2Yale University, New Haven, United States. 3Broad Institute of MIT and Harvard, Boston United States.
Pseudocode No The paper describes the method using textual descriptions and mathematical equations but does not include explicit pseudocode or algorithm blocks.
Open Source Code No The paper mentions using the released codes of baseline models ('We use all their released codes on Git Hub') but does not state that the code for Surf Pro itself is open-source or provide a link to it.
Open Datasets Yes Following previous work (Dauparas et al., 2022; Gao et al., 2022), we use the CATH 4.2 dataset curated by Ingraham et al. (2019) and follow the same data splits of Jing et al. (2020). We collect experimentally confirmed positive complexes of <binder, target protein> pairs across six categories from Bennett et al. (2023). We collect five categories of enzymes from Kroll et al. (2023a), each of which binds to a specific substrate.
Dataset Splits Yes As a consequence, the training, validation, and test splits consist of 14525, 468, and 887 samples, respectively. For categories with over 50 complexes, we employ an 8 : 1 : 1 random split for training, validation, and test sets; otherwise, all complexes are included in the test set, establishing a zero-shot scenario. For enzyme categories containing more than 100 samples, we randomly split the data into training, validation, and test sets using an 8 : 1 : 1 ratio after clustering; otherwise, all data are taken as the test set.
Hardware Specification Yes The model, trained with one NVIDIA RTX A6000 GPU card, utilizes the Adam optimizer (Kingma & Ba, 2014).
Software Dependencies No The paper mentions software like MSMS, AlphaFold2, ESMFold, and Adam optimizer but does not provide specific version numbers for any of these dependencies, which are necessary for reproducibility.
Experiment Setup Yes We set a maximum limit of 5,000 vertices for each surface. Surfaces with fewer than 5,000 vertices remain unchanged, while those exceeding this limit are compressed with a down-sampling ratio r set to 5,000/N, where N denotes the original vertex count. The minimum vertex number in a cube Nmin in surface compression is set to 32. Local perspective modeling utilizes three layers, and global landscape modeling employs a two-layer FAMHA. The two biochemical features are mapped to a hidden space with a dimensionality of 256. The autoregressive decoder is built with 3-layer Transformer decoder. The mini-batch size and learning rate are set to 4,096 tokens and 5e-4, respectively.