SaProt: Protein Language Modeling with Structure-aware Vocabulary

Authors: Jin Su, Chenchen Han, Yuyang Zhou, Junjie Shan, Xibin Zhou, Fajie Yuan

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive evaluation, our Sa Prot model surpasses well-established and renowned baselines across 10 significant downstream tasks, demonstrating its exceptional capacity and broad applicability. We evaluate Sa Prot across 10 diverse downstream tasks, encompassing residue-level and protein-level tasks. We conduct a series of enlightening ablation studies, unveiling previously unknown findings.
Researcher Affiliation Academia Jin Su1,2 , Chenchen Han2, Yuyang Zhou2, Junjie Shan2, Xibin Zhou2, Fajie Yuan2 Zhejiang University1, Westlake University2
Pseudocode No The paper describes algorithms and methods but does not provide pseudocode or a clearly labeled algorithm block.
Open Source Code Yes We have made the code1, pretrained model, and all relevant materials available at https://github.com/ westlake-repl/Sa Prot.
Open Datasets Yes We adopt the Protein Gym (Notin et al., 2022) benchmark and Clin Var (Landrum et al., 2018) dataset... We download all AF2 structures based on Uni Prot ids. After the release of the Alpha Fold DB (Varadi et al., 2021), the majority of predicted structures are now accessible through searching Uni Prot IDs.
Dataset Splits Yes With the exception of the Metal Ion Binding and Deep Loc tasks, we utilize the official data split in the related benchmark literature (TAPE (Rao et al., 2019), PEER (Xu et al., 2022) and FLIP (Dallago et al., 2021)), which includes separate training, validation, and testing sets.
Hardware Specification Yes Its training lasted 3 months and utilized 64 NVIDIA 80G A100 GPUs
Software Dependencies No The paper mentions using Adam W optimizer but does not specify versions for any ancillary software like Python, PyTorch, or specific libraries.
Experiment Setup Yes Specifically, we employ the Adam W optimizer (Loshchilov & Hutter, 2017), setting β1 = 0.9, β2 = 0.98 and we utilized L2 weight decay of 0.01. We gradually increase the learning rate from 0 to 4e-4 over the first 2000 steps and linearly lower it to 5e-4 from 150K steps to 1.5M steps. The overall training phase lasts approximately 3M steps. To deal with long sequences, we truncate them to a maximum of 1024 tokens, and our batch size consists of 512 sequences. Additionally, we also employ mixed precision training to train Sa Prot.