OntoProtein: Protein Pretraining With Gene Ontology Embedding

Authors: Ningyu Zhang, Zhen Bi, Xiaozhuan Liang, Siyuan Cheng, Haosen Hong, Shumin Deng, Qiang Zhang, Jiazhang Lian, Huajun Chen

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results show that Onto Protein can surpass state-of-the-art methods with pre-trained protein language models in TAPE benchmark and yield better performance compared with baselines in protein-protein interaction and protein function prediction. [...] We conduct extensive experiments in widespread protein tasks, including TAPE benchmark, protein-protein interaction prediction, and protein function prediction, which demonstrate the effectiveness of our proposed approach.
Researcher Affiliation Academia 1College of Computer Science and Technology, Zhejiang University 2School of Software Technology, Zhejiang University 3Alibaba-Zhejiang University Joint Research Institute of Frontier Technologies 4Hangzhou Innovation Center, Zhejiang University {zhangningyu,bizhen zju,liangxiaozhuan,22151070}@zju.edu.cn {231sm,12028071,jzlian,qiang.zhang.cs,huajunsir}@zju.edu.cn
Pseudocode No The paper does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code and datasets are available in https://github.com/zjunlp/Onto Protein.
Open Datasets Yes Pre-training Dataset To incorporate Gene Ontology knowledge into language models, we build a new pre-training dataset called Protein KG256, which is a large-scale KG dataset with aligned descriptions and protein sequences respectively to GO terms7 and proteins entities. [...] Our code and datasets are all available in the https://github.com/zjunlp/ Onto Protein for reproducibility.
Dataset Splits Yes We deliver data splits for both the inductive and the transductive settings to promote future research. [...] We design two evaluation schemes, the transductive and the inductive settings, which simulate two scenarios of gene annotation in reality. [...] Table 6: Hyper-parameters for the downstream task.
Hardware Specification Yes We utilize Pytorch (Paszke et al. (2019)) to conduct experiments with Nvidia V100 GPUs.
Software Dependencies No We utilize Pytorch (Paszke et al. (2019)) to conduct experiments with Nvidia V100 GPUs.
Experiment Setup Yes This section details the training procedures and hyperparameters for each of the datasets. We utilize Pytorch (Paszke et al. (2019)) to conduct experiments with Nvidia V100 GPUs. In pre-training of Onto Protein, similar to Elnaggar et al. (2020), we use the same training protocol such as optimizer, learning rate schedule on BERT model. We set γ to 12.0 and the number of negative sampling to 128 in Equation 1. [...] Table 6: Hyper-parameters for the downstream task.