SaProt: Protein Language Modeling with Structure-aware Vocabulary
Authors: Jin Su, Chenchen Han, Yuyang Zhou, Junjie Shan, Xibin Zhou, Fajie Yuan
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive evaluation, our Sa Prot model surpasses well-established and renowned baselines across 10 significant downstream tasks, demonstrating its exceptional capacity and broad applicability. We evaluate Sa Prot across 10 diverse downstream tasks, encompassing residue-level and protein-level tasks. We conduct a series of enlightening ablation studies, unveiling previously unknown findings. |
| Researcher Affiliation | Academia | Jin Su1,2 , Chenchen Han2, Yuyang Zhou2, Junjie Shan2, Xibin Zhou2, Fajie Yuan2 Zhejiang University1, Westlake University2 |
| Pseudocode | No | The paper describes algorithms and methods but does not provide pseudocode or a clearly labeled algorithm block. |
| Open Source Code | Yes | We have made the code1, pretrained model, and all relevant materials available at https://github.com/ westlake-repl/Sa Prot. |
| Open Datasets | Yes | We adopt the Protein Gym (Notin et al., 2022) benchmark and Clin Var (Landrum et al., 2018) dataset... We download all AF2 structures based on Uni Prot ids. After the release of the Alpha Fold DB (Varadi et al., 2021), the majority of predicted structures are now accessible through searching Uni Prot IDs. |
| Dataset Splits | Yes | With the exception of the Metal Ion Binding and Deep Loc tasks, we utilize the official data split in the related benchmark literature (TAPE (Rao et al., 2019), PEER (Xu et al., 2022) and FLIP (Dallago et al., 2021)), which includes separate training, validation, and testing sets. |
| Hardware Specification | Yes | Its training lasted 3 months and utilized 64 NVIDIA 80G A100 GPUs |
| Software Dependencies | No | The paper mentions using Adam W optimizer but does not specify versions for any ancillary software like Python, PyTorch, or specific libraries. |
| Experiment Setup | Yes | Specifically, we employ the Adam W optimizer (Loshchilov & Hutter, 2017), setting β1 = 0.9, β2 = 0.98 and we utilized L2 weight decay of 0.01. We gradually increase the learning rate from 0 to 4e-4 over the first 2000 steps and linearly lower it to 5e-4 from 150K steps to 1.5M steps. The overall training phase lasts approximately 3M steps. To deal with long sequences, we truncate them to a maximum of 1024 tokens, and our batch size consists of 512 sequences. Additionally, we also employ mixed precision training to train Sa Prot. |