reproducibilityindex.ai

SaProt: Protein Language Modeling with Structure-aware Vocabulary

Authors: Jin Su, Chenchen Han, Yuyang Zhou, Junjie Shan, Xibin Zhou, Fajie Yuan

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through extensive evaluation, our Sa Prot model surpasses well-established and renowned baselines across 10 significant downstream tasks, demonstrating its exceptional capacity and broad applicability. We evaluate Sa Prot across 10 diverse downstream tasks, encompassing residue-level and protein-level tasks. We conduct a series of enlightening ablation studies, unveiling previously unknown findings.
Researcher Affiliation	Academia	Jin Su1,2 , Chenchen Han2, Yuyang Zhou2, Junjie Shan2, Xibin Zhou2, Fajie Yuan2 Zhejiang University1, Westlake University2
Pseudocode	No	The paper describes algorithms and methods but does not provide pseudocode or a clearly labeled algorithm block.
Open Source Code	Yes	We have made the code1, pretrained model, and all relevant materials available at https://github.com/ westlake-repl/Sa Prot.
Open Datasets	Yes	We adopt the Protein Gym (Notin et al., 2022) benchmark and Clin Var (Landrum et al., 2018) dataset... We download all AF2 structures based on Uni Prot ids. After the release of the Alpha Fold DB (Varadi et al., 2021), the majority of predicted structures are now accessible through searching Uni Prot IDs.
Dataset Splits	Yes	With the exception of the Metal Ion Binding and Deep Loc tasks, we utilize the official data split in the related benchmark literature (TAPE (Rao et al., 2019), PEER (Xu et al., 2022) and FLIP (Dallago et al., 2021)), which includes separate training, validation, and testing sets.
Hardware Specification	Yes	Its training lasted 3 months and utilized 64 NVIDIA 80G A100 GPUs
Software Dependencies	No	The paper mentions using Adam W optimizer but does not specify versions for any ancillary software like Python, PyTorch, or specific libraries.
Experiment Setup	Yes	Specifically, we employ the Adam W optimizer (Loshchilov & Hutter, 2017), setting β1 = 0.9, β2 = 0.98 and we utilized L2 weight decay of 0.01. We gradually increase the learning rate from 0 to 4e-4 over the first 2000 steps and linearly lower it to 5e-4 from 150K steps to 1.5M steps. The overall training phase lasts approximately 3M steps. To deal with long sequences, we truncate them to a maximum of 1024 tokens, and our batch size consists of 512 sequences. Additionally, we also employ mixed precision training to train Sa Prot.