Protein Representation Learning via Knowledge Enhanced Primary Structure Reasoning

Authors: Hong-Yu Zhou, Yunxiang Fu, Zhicheng Zhang, Bian Cheng, Yizhou Yu

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we extensively evaluate the generalization ability of the learned protein representation by fine-tuning the pre-trained model on a wide range of downstream applications, including amino acid contact prediction, protein homology detection, protein stability prediction, proteinprotein interaction identification, protein-protein binding affinity prediction, and semantic similarity inference. Besides, we also provide ablations and failure analyses to facilitate the understanding of Ke AP. Unless otherwise specified, we follow the pre-training and fine-tuning protocols used by Onto Protein (refer to the appendix for more details), such as training strategies and dataset split. The pre-trained models of Prot Bert (Elnaggar et al., 2021), Onto Protein (Zhang et al., 2022), and our Ke AP share the same number of network parameters. Average results are reported over three independent training runs.
Researcher Affiliation Collaboration 1Department of Computer Science, The University of Hong Kong 2Xiaohe Healthcare, Byte Dance 3Jancsi Tech 4OPPO Health Lab
Pseudocode No The paper includes architectural diagrams (Figure 2) but no explicit pseudocode or algorithm blocks.
Open Source Code Yes Code and models are available at https://github.com/RL4M/Ke AP.
Open Datasets Yes Protein KG25 (Zhang et al., 2022) provides a knowledge graph that consists of approximately five million triplets, with nearly 600k protein, 50k attribute terms, and 31 relation terms included. ... The data are from Hou et al. (2018) and we report average accuracy on the fold-level heldout set. ... We evaluate the model performance by calculating Spearman s rank correlation scores on the whole test set Rocklin et al. (2017). ... We perform experiments on SHS27K (Chen et al., 2019), SHS148K (Chen et al., 2019), and STRING (Lv et al., 2021). ... We used the SKEMPI dataset from (Moal & Fern andez-Recio, 2012) and report the mean square error of 10-fold cross-validation.
Dataset Splits No The paper mentions '10-fold cross-validation' for one task (protein-protein binding affinity prediction) and provides hyper-parameters in Table 10 (Appendix A.1), but it does not specify explicit train/validation/test dataset splits (e.g., percentages or counts) for the general experimental setup that would allow full reproduction of data partitioning.
Hardware Specification No The paper does not provide specific hardware details (e.g., CPU/GPU models, memory, or cloud instance types) used for running the experiments.
Software Dependencies No The paper mentions 'BERT-like architecture (Devlin et al., 2018)', 'Pub Med BERT (Gu et al., 2021)', 'QKV Attention (Vaswani et al., 2017)', 'layer normalization (Ba et al., 2016)', 'residual multi-layer perceptron (MLP)', and 'Adam W' optimizer. However, it does not specify version numbers for any of these software components or libraries.
Experiment Setup Yes The hyper-parameters for fine-tuning are provided in Table 10. Specifically, we follow the hyperparameter settings in GNN-PPI (Lv et al., 2021) for PPI prediction. For protein binding affinity prediction and semantic similarity inference, we follow the fine-tuning configurations in PROBE (Unsal et al., 2022). Table 10: Task Epoch Batch size Warmup ratio Learning rate Freeze Bert Optimizer