Self-supervised Pocket Pretraining via Protein Fragment-Surroundings Alignment

Authors: Bowen Gao, Yinjun Jia, YuanLe Mo, Yuyan Ni, Wei-Ying Ma, Zhi-Ming Ma, Yanyan Lan

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our method, named Pro FSA, achieves state-of-the-art performance across various tasks, including pocket druggability prediction, pocket matching, and ligand binding affinity prediction. 4 EXPERIMENTS To evaluate the performance of our proposed model, we conduct extensive experiments mainly from three perspectives: 1. Pocket-only tasks including pocket druggability prediction and pocket matching in 4.1 and 4.2; 2. The pocket-molecule task of ligand binding affinity prediction in 4.3; 3. Ablation studies in 4.4 illustrating the impact of diverse data scales, molecular encoders, and data distributions.
Researcher Affiliation Academia Bowen Gao1 , Yinjun Jia2 , Yuanle Mo3, Yuyan Ni4, Weiying Ma1, Zhiming Ma4, Yanyan Lan1,5 1Institute for AI Industry Research, Tsinghua University 2School of Life Sciences, IDG/Mc Govern Institute for Brain Research, Tsinghua University 3School of Information and Software Engineering, UESTC 4Academy of Mathematics and Systems Science, Chinese Academy of Sciences 5Beijing Frontier Research Center for Biological Structure, Tsinghua University
Pseudocode Yes A simplified algorithm is presented in Algorithm 1 to illustrate key steps. Algorithm 1 The Construction of Pseudo-Ligand-Pocket Complexes
Open Source Code Yes The code and data is available at https://github.com/bowen-gao/Pro FSA.
Open Datasets Yes Currently available protein pocket data are all collected from the Protein Data Bank (PDB) (Berman et al., 2000). The most famous database is the PDBBind (Liu et al., 2015; Wang et al., 2005; 2004), which consists of 19,443 protein-ligand pairs in the latest version (v2020). Biolip2 (Zhang et al., 2023) is one of the most comprehensive ones, which includes 467,808 pocket-ligand pairs... The Kahraman dataset (Kahraman et al., 2010; Ehrt et al., 2018)... TOUGH-M1 dataset (Govindaraj & Brylinski, 2018)...
Dataset Splits Yes As for the split based on a 30% sequence identity threshold, the resulting sizes of training, validation, and test sets are 3507, 466, and 490, respectively, while the 60% sequence identity threshold leads to counterparts of size 3678, 460, and 460, respectively.
Hardware Specification Yes During the pretraining phase, we utilize a batch size of 4 × 48 on 4 Nvidia A100 GPUs.
Software Dependencies No The paper mentions software components like 'Adam optimizer' and 'Python', and the 'Free SASA package in Python', but does not provide specific version numbers for any key software dependencies or libraries.
Experiment Setup Yes During the pretraining phase, we utilize a batch size of 4 × 48 on 4 Nvidia A100 GPUs. We choose the Adam optimizer with a learning rate of 1 × 10−4 and cap the training at 100 epochs. A polynomial decay scheduler with a warmup ratio of 0.06 is implemented. The checkpoint yielding the best validation AUC is retained, complemented by an early stopping strategy set with a 20-epoch patience.