reproducibilityindex.ai

Multi-level Protein Structure Pre-training via Prompt Learning

Authors: Zeyuan Wang, Qiang Zhang, Shuang-Wei HU, Haoran Yu, Xurui Jin, Zhichen Gong, Huajun Chen

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on function prediction and protein engineering show that Prompt Protein outperforms state-of-the-art methods by large margins.
Researcher Affiliation	Collaboration	Zeyuan Wang1,2,7 Qiang Zhang1,2 Haoran Yu2,3 Shuangwei Hu4 Xurui Jin5 Zhichen Gong2,6 Huajun Chen1,2, 7, 8 1College of Computer Science and Technology, Zhejiang University 2ZJU-Hangzhou Global Scientific and Technological Innovation Center 3College of Chemical and Biological Engineering, Zhejiang University 4Vecx Biomedicines Inc., 5Mind Rank AI Ltd., 6University College London 7AZFT Joint Lab for Knowledge Engine, 8East China Sea Laboratory
Pseudocode	Yes	For sake of understanding, we provide the pseudo-code of the prompt-guided multi-task pre-training and fine-tuning framework in Appendix A.3. (Referring to 'Algorithm 1: Prompt-Guided Multi-Task Pre-Training' and 'Algorithm 2: Prompt-Guided Fine-tuning' in Appendix A.3)
Open Source Code	No	The source code will be available online. (This is a promise for future availability, not current concrete access)
Open Datasets	Yes	For the primary structural information, we use Uni Ref50 (Suzek et al., 2015) which is a clustering of Uni Ref90 seed sequences at 50% sequence identity. [...] For the secondary and tertiary structural information, we use Protein Data Bank (PDB) (Berman et al., 2000). [...] For the quaternary structure information, we use the STRING dataset (Szklarczyk et al., 2019).
Dataset Splits	Yes	10% of Uni Ref50 clusters are randomly selected as a held-out evaluation set. [...] We follow the dataset split method in (Gligorijevi c et al., 2021). [...] Table 4: Statistics of the downstream datasets. DATASET #TRAIN #VALIDATION #TEST
Hardware Specification	Yes	All models are trained on 2 A100 40G GPUs for 270k steps of updates.
Software Dependencies	No	The paper states 'We implement Prompt Protein using Pytorch (Paszke et al., 2019) and Fairseq (Ott et al., 2019).' It names the software but does not provide specific version numbers.
Experiment Setup	Yes	Prompt Protein has 650M parameters with 33 layers and 20 attention heads. The embedding size is 1280. The learning rate is 1e-4 with no weight decay. We use an inverse square root learning rate schedule. All models are trained on 2 A100 40G GPUs for 270k steps of updates.