Self-Supervised Pre-training for Protein Embeddings Using Tertiary Structures
Authors: Yuzhi Guo, Jiaxiang Wu, Hehuan Ma, Junzhou Huang6801-6809
AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the effectiveness of our pre-training model on two downstream tasks, protein structure quality assessment (QA) and protein-protein interaction (PPI) site prediction. Hierarchical structure embeddings are extracted to enhance corresponding prediction models. Extensive experiments indicate that such structure embeddings consistently improve the prediction accuracy for both downstream tasks. |
| Researcher Affiliation | Collaboration | 1University of Texas at Arlington, Arlington, TX, 76019, USA 2Tencent AI Lab, Shenzhen, 518057, China |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any explicit statements about the release of source code or links to a code repository for the described methodology. |
| Open Datasets | Yes | For the pre-training model, we obtain native protein structures from the RCSB-PDB database (released on 01/05/2021) (Berman et al. 2000), which includes over 170 thousands unlabeled protein tertiary structures. For the protein QA prediction task, we use the dataset published by Graph QA (Baldassarre et al. 2021). For the PPI site prediction task, we use the processed data from Deep PPISP (Zeng et al. 2020), i.e. Dset 186 of 186 proteins, Dset 72 of 72 proteins (Murakami and Mizuguchi 2010) and PDBset 164 of 164 proteins (Singh et al. 2014). |
| Dataset Splits | Yes | After removing overlap proteins with valid and test data in downstream tasks, the BC100 dataset contains 73,585 proteins, among which 58,868 are used as the training set, 7,357 as the validation set, and the remaining ones are test set. The BC-30 dataset consists of 29,242 proteins. Within them, 23,394 proteins are used as training set, 2,923 as the validation set, and 2,925 proteins are used for testing. For the protein QA prediction task, we use the dataset published by Graph QA (Baldassarre et al. 2021). CASP9-CASP12 datasets contain 85k decoys, which are randomly split into a training set ( 270 targets) and a validation set ( 50 targets). For the PPI site prediction task... There are 300 proteins in the training set, 50 proteins for independent validation set, and 70 proteins in the test set. |
| Hardware Specification | No | The paper vaguely mentions 'high-performance GPU clusters' but does not provide specific details on the hardware used for experiments, such as exact GPU or CPU models, or memory specifications. |
| Software Dependencies | No | The paper does not provide specific version numbers for software dependencies or libraries used in the experiments. |
| Experiment Setup | Yes | The number of hidden layers channels k is set to 64. We use a batch size of 32 for training and validation, and randomly crop the input feature maps with size 32 for data augmentation. The positional encodings dimension is set to dmodel = 24. We construct random noise s standard deviations for K = 32 levels, which ranges from 0.01 to 10.0. For the optimization, we apply a constant learning rate of 0.0001 and use Adam (Kingma and Ba 2014) as the optimizer for our pre-training model. After training 50 epochs, we select the optimal checkpoint based on the validation loss. |