Multi-level Protein Structure Pre-training via Prompt Learning
Authors: Zeyuan Wang, Qiang Zhang, Shuang-Wei HU, Haoran Yu, Xurui Jin, Zhichen Gong, Huajun Chen
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on function prediction and protein engineering show that Prompt Protein outperforms state-of-the-art methods by large margins. |
| Researcher Affiliation | Collaboration | Zeyuan Wang1,2,7 Qiang Zhang1,2 Haoran Yu2,3 Shuangwei Hu4 Xurui Jin5 Zhichen Gong2,6 Huajun Chen1,2, 7, 8 1College of Computer Science and Technology, Zhejiang University 2ZJU-Hangzhou Global Scientific and Technological Innovation Center 3College of Chemical and Biological Engineering, Zhejiang University 4Vecx Biomedicines Inc., 5Mind Rank AI Ltd., 6University College London 7AZFT Joint Lab for Knowledge Engine, 8East China Sea Laboratory |
| Pseudocode | Yes | For sake of understanding, we provide the pseudo-code of the prompt-guided multi-task pre-training and fine-tuning framework in Appendix A.3. (Referring to 'Algorithm 1: Prompt-Guided Multi-Task Pre-Training' and 'Algorithm 2: Prompt-Guided Fine-tuning' in Appendix A.3) |
| Open Source Code | No | The source code will be available online. (This is a promise for future availability, not current concrete access) |
| Open Datasets | Yes | For the primary structural information, we use Uni Ref50 (Suzek et al., 2015) which is a clustering of Uni Ref90 seed sequences at 50% sequence identity. [...] For the secondary and tertiary structural information, we use Protein Data Bank (PDB) (Berman et al., 2000). [...] For the quaternary structure information, we use the STRING dataset (Szklarczyk et al., 2019). |
| Dataset Splits | Yes | 10% of Uni Ref50 clusters are randomly selected as a held-out evaluation set. [...] We follow the dataset split method in (Gligorijevi c et al., 2021). [...] Table 4: Statistics of the downstream datasets. DATASET #TRAIN #VALIDATION #TEST |
| Hardware Specification | Yes | All models are trained on 2 A100 40G GPUs for 270k steps of updates. |
| Software Dependencies | No | The paper states 'We implement Prompt Protein using Pytorch (Paszke et al., 2019) and Fairseq (Ott et al., 2019).' It names the software but does not provide specific version numbers. |
| Experiment Setup | Yes | Prompt Protein has 650M parameters with 33 layers and 20 attention heads. The embedding size is 1280. The learning rate is 1e-4 with no weight decay. We use an inverse square root learning rate schedule. All models are trained on 2 A100 40G GPUs for 270k steps of updates. |