Self-supervised Pocket Pretraining via Protein Fragment-Surroundings Alignment
Authors: Bowen Gao, Yinjun Jia, YuanLe Mo, Yuyan Ni, Wei-Ying Ma, Zhi-Ming Ma, Yanyan Lan
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our method, named Pro FSA, achieves state-of-the-art performance across various tasks, including pocket druggability prediction, pocket matching, and ligand binding affinity prediction. 4 EXPERIMENTS To evaluate the performance of our proposed model, we conduct extensive experiments mainly from three perspectives: 1. Pocket-only tasks including pocket druggability prediction and pocket matching in 4.1 and 4.2; 2. The pocket-molecule task of ligand binding affinity prediction in 4.3; 3. Ablation studies in 4.4 illustrating the impact of diverse data scales, molecular encoders, and data distributions. |
| Researcher Affiliation | Academia | Bowen Gao1 , Yinjun Jia2 , Yuanle Mo3, Yuyan Ni4, Weiying Ma1, Zhiming Ma4, Yanyan Lan1,5 1Institute for AI Industry Research, Tsinghua University 2School of Life Sciences, IDG/Mc Govern Institute for Brain Research, Tsinghua University 3School of Information and Software Engineering, UESTC 4Academy of Mathematics and Systems Science, Chinese Academy of Sciences 5Beijing Frontier Research Center for Biological Structure, Tsinghua University |
| Pseudocode | Yes | A simplified algorithm is presented in Algorithm 1 to illustrate key steps. Algorithm 1 The Construction of Pseudo-Ligand-Pocket Complexes |
| Open Source Code | Yes | The code and data is available at https://github.com/bowen-gao/Pro FSA. |
| Open Datasets | Yes | Currently available protein pocket data are all collected from the Protein Data Bank (PDB) (Berman et al., 2000). The most famous database is the PDBBind (Liu et al., 2015; Wang et al., 2005; 2004), which consists of 19,443 protein-ligand pairs in the latest version (v2020). Biolip2 (Zhang et al., 2023) is one of the most comprehensive ones, which includes 467,808 pocket-ligand pairs... The Kahraman dataset (Kahraman et al., 2010; Ehrt et al., 2018)... TOUGH-M1 dataset (Govindaraj & Brylinski, 2018)... |
| Dataset Splits | Yes | As for the split based on a 30% sequence identity threshold, the resulting sizes of training, validation, and test sets are 3507, 466, and 490, respectively, while the 60% sequence identity threshold leads to counterparts of size 3678, 460, and 460, respectively. |
| Hardware Specification | Yes | During the pretraining phase, we utilize a batch size of 4 × 48 on 4 Nvidia A100 GPUs. |
| Software Dependencies | No | The paper mentions software components like 'Adam optimizer' and 'Python', and the 'Free SASA package in Python', but does not provide specific version numbers for any key software dependencies or libraries. |
| Experiment Setup | Yes | During the pretraining phase, we utilize a batch size of 4 × 48 on 4 Nvidia A100 GPUs. We choose the Adam optimizer with a learning rate of 1 × 10−4 and cap the training at 100 epochs. A polynomial decay scheduler with a warmup ratio of 0.06 is implemented. The checkpoint yielding the best validation AUC is retained, complemented by an early stopping strategy set with a 20-epoch patience. |