Structure-informed Language Models Are Protein Designers
Authors: Zaixiang Zheng, Yifan Deng, Dongyu Xue, Yi Zhou, Fei Ye, Quanquan Gu
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show that LM-DESIGN improves the state-of-the-art results by a large margin, leading to 4% to 12% accuracy gains in sequence recovery (e.g., 55.65%/56.63% on CATH 4.2/4.3 single-chain benchmarks, and >60% when designing protein complexes). |
| Researcher Affiliation | Collaboration | 1Byte Dance Research 2Dept. of Computer Science, University of Wisconsin-Madison (work was done during Yifan s internship at Byte Dance Research). |
| Pseudocode | Yes | Figure 11. Illustration of our instantiation of the structural adapter |
| Open Source Code | No | The paper states that ESM-1b and Protein MPNN are "openly accessible", but it does not provide an explicit statement or a link for the open-source code of LM-DESIGN, the methodology described in this paper. |
| Open Datasets | Yes | We mainly compared LM-DESIGN against recent strong baselines on the CATH 4.2 (Orengo et al., 1997) dataset, using the same data splits as the compared systems... To compare with ESM-IF (Hsu et al., 2022), we also conducted evaluations on CATH 4.3... Uni Ref50 (Suzek et al., 2015)... SWISS-PROT (Boeckmann et al., 2003) in our experiment. |
| Dataset Splits | Yes | where proteins were partitioned by the CATH 4.2 topology classification, resulting in 18024 proteins for training, 608 proteins for validation, and 1120 proteins for testing. CATH 4.3, wherein 16153 structures are assigned to the training set, 1457 to the validation set, and 1797 to the test set. |
| Hardware Specification | Yes | The models were trained up to 100 epochs by default using the Adam optimizer on NVIDIA V100s. |
| Software Dependencies | No | The paper mentions several software tools and models used (e.g., "Adam optimizer", "Alpha Fold 2", "MMseqs2", "DSSP", "Biopython"), but it does not specify explicit version numbers for these software dependencies, which are crucial for reproducibility. |
| Experiment Setup | Yes | The models were trained up to 100 epochs by default using the Adam optimizer on NVIDIA V100s. We used the same training settings as Protein MPNN (Dauparas et al., 2022), where the batch size was set to approximately 6000 residues, and Adam optimizer (Kingma & Ba, 2015) with noam learning rate scheduler (Vaswani et al., 2017) was used. |