SurfPro: Functional Protein Design Based on Continuous Surface
Authors: Zhenqiao Song, Tinglin Huang, Lei Li, Wengong Jin
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate Surf Pro on a standard inverse folding benchmark CATH 4.2 and two functional protein design tasks: protein binder design and enzyme design. Our Surf Pro consistently surpasses previous state-of-the-art inverse folding methods, achieving a recovery rate of 57.78% on CATH 4.2 and higher success rates in terms of protein-protein binding and enzyme-substrate interaction scores. |
| Researcher Affiliation | Academia | 1Language Technologies Institute, Carnegie Mellon University, Pittsburgh, the United States. 2Yale University, New Haven, United States. 3Broad Institute of MIT and Harvard, Boston United States. |
| Pseudocode | No | The paper describes the method using textual descriptions and mathematical equations but does not include explicit pseudocode or algorithm blocks. |
| Open Source Code | No | The paper mentions using the released codes of baseline models ('We use all their released codes on Git Hub') but does not state that the code for Surf Pro itself is open-source or provide a link to it. |
| Open Datasets | Yes | Following previous work (Dauparas et al., 2022; Gao et al., 2022), we use the CATH 4.2 dataset curated by Ingraham et al. (2019) and follow the same data splits of Jing et al. (2020). We collect experimentally confirmed positive complexes of <binder, target protein> pairs across six categories from Bennett et al. (2023). We collect five categories of enzymes from Kroll et al. (2023a), each of which binds to a specific substrate. |
| Dataset Splits | Yes | As a consequence, the training, validation, and test splits consist of 14525, 468, and 887 samples, respectively. For categories with over 50 complexes, we employ an 8 : 1 : 1 random split for training, validation, and test sets; otherwise, all complexes are included in the test set, establishing a zero-shot scenario. For enzyme categories containing more than 100 samples, we randomly split the data into training, validation, and test sets using an 8 : 1 : 1 ratio after clustering; otherwise, all data are taken as the test set. |
| Hardware Specification | Yes | The model, trained with one NVIDIA RTX A6000 GPU card, utilizes the Adam optimizer (Kingma & Ba, 2014). |
| Software Dependencies | No | The paper mentions software like MSMS, AlphaFold2, ESMFold, and Adam optimizer but does not provide specific version numbers for any of these dependencies, which are necessary for reproducibility. |
| Experiment Setup | Yes | We set a maximum limit of 5,000 vertices for each surface. Surfaces with fewer than 5,000 vertices remain unchanged, while those exceeding this limit are compressed with a down-sampling ratio r set to 5,000/N, where N denotes the original vertex count. The minimum vertex number in a cube Nmin in surface compression is set to 32. Local perspective modeling utilizes three layers, and global landscape modeling employs a two-layer FAMHA. The two biochemical features are mapped to a hidden space with a dimensionality of 256. The autoregressive decoder is built with 3-layer Transformer decoder. The mini-batch size and learning rate are set to 4,096 tokens and 5e-4, respectively. |