OntoProtein: Protein Pretraining With Gene Ontology Embedding
Authors: Ningyu Zhang, Zhen Bi, Xiaozhuan Liang, Siyuan Cheng, Haosen Hong, Shumin Deng, Qiang Zhang, Jiazhang Lian, Huajun Chen
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results show that Onto Protein can surpass state-of-the-art methods with pre-trained protein language models in TAPE benchmark and yield better performance compared with baselines in protein-protein interaction and protein function prediction. [...] We conduct extensive experiments in widespread protein tasks, including TAPE benchmark, protein-protein interaction prediction, and protein function prediction, which demonstrate the effectiveness of our proposed approach. |
| Researcher Affiliation | Academia | 1College of Computer Science and Technology, Zhejiang University 2School of Software Technology, Zhejiang University 3Alibaba-Zhejiang University Joint Research Institute of Frontier Technologies 4Hangzhou Innovation Center, Zhejiang University {zhangningyu,bizhen zju,liangxiaozhuan,22151070}@zju.edu.cn {231sm,12028071,jzlian,qiang.zhang.cs,huajunsir}@zju.edu.cn |
| Pseudocode | No | The paper does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code and datasets are available in https://github.com/zjunlp/Onto Protein. |
| Open Datasets | Yes | Pre-training Dataset To incorporate Gene Ontology knowledge into language models, we build a new pre-training dataset called Protein KG256, which is a large-scale KG dataset with aligned descriptions and protein sequences respectively to GO terms7 and proteins entities. [...] Our code and datasets are all available in the https://github.com/zjunlp/ Onto Protein for reproducibility. |
| Dataset Splits | Yes | We deliver data splits for both the inductive and the transductive settings to promote future research. [...] We design two evaluation schemes, the transductive and the inductive settings, which simulate two scenarios of gene annotation in reality. [...] Table 6: Hyper-parameters for the downstream task. |
| Hardware Specification | Yes | We utilize Pytorch (Paszke et al. (2019)) to conduct experiments with Nvidia V100 GPUs. |
| Software Dependencies | No | We utilize Pytorch (Paszke et al. (2019)) to conduct experiments with Nvidia V100 GPUs. |
| Experiment Setup | Yes | This section details the training procedures and hyperparameters for each of the datasets. We utilize Pytorch (Paszke et al. (2019)) to conduct experiments with Nvidia V100 GPUs. In pre-training of Onto Protein, similar to Elnaggar et al. (2020), we use the same training protocol such as optimizer, learning rate schedule on BERT model. We set γ to 12.0 and the number of negative sampling to 128 in Equation 1. [...] Table 6: Hyper-parameters for the downstream task. |