reproducibilityindex.ai

ProtST: Multi-Modality Learning of Protein Sequences and Biomedical Texts

Authors: Minghao Xu, Xinyu Yuan, Santiago Miret, Jian Tang

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We verify the superiority of Prot ST-induced PLMs over previous ones on diverse representation learning benchmarks. ... We investigate the PLMs trained under Prot ST by representation learning and zero-shot prediction. For representation learning, we verify their superior performance over previous masked language modeling and knowledge-enhanced PLMs on 11 standard benchmarks for protein localization prediction, fitness landscape prediction and protein function annotation (Sec. 4.2).
Researcher Affiliation	Collaboration	1Mila Qu ebec AI Institute 2Universit e de Montr eal 3Intel Labs 4HEC Montr eal 5CIFAR AI Research Chair. Correspondence to: Minghao Xu <minghao.xu@mila.quebec>, Santiago Miret <santiago.miret@intel.com>, Jian Tang <jian.tang.ca>.
Pseudocode	No	No pseudocode or clearly labeled algorithm block was found in the paper. Methods are described in prose and diagrams.
Open Source Code	Yes	Source code and model weights are available at https://github. com/Deep Graph Learning/Prot ST.
Open Datasets	Yes	To inject protein property information into PLMs, we build the Prot Describe dataset with 553,052 aligned pairs of protein sequence and property description. Specifically, we employ the Swiss-Prot (Bairoch & Apweiler, 2000) database to provide annotations of various protein properties...
Dataset Splits	Yes	For all models on all tasks, we select the checkpoint for evaluation based on the validation set performance, and all results are reported on the seed 0.
Hardware Specification	Yes	An Adam optimizer (Kingma & Ba, 2014) (learning rate: 1.0 × 10−5, weight decay: 0) is used to train the whole model for 20 epochs on 4 Tesla V100 GPUs.
Software Dependencies	No	The paper mentions using "Adam optimizer" and "Torch Drug" but does not provide specific version numbers for these software components or other libraries used in the implementation.
Experiment Setup	Yes	An Adam optimizer (Kingma & Ba, 2014) (learning rate: 1.0 × 10−5, weight decay: 0) is used to train the whole model for 20 epochs on 4 Tesla V100 GPUs. ... Prot ST-Prot Bert adopts the batch size of 16 (4 proteins per GPU), and Prot ST-ESM-1b and Prot ST-ESM-2 adopt the batch size of 12 (3 proteins per GPU). ... We truncate the protein sequences that have more than 450 residues to the length of 450, where the truncation starts from a random residue before the last 450 ones. ... we initialize the temperature parameter τ in Eq. (1) as 0.07 and optimize it along the training process.