Structure-informed Language Models Are Protein Designers

Authors: Zaixiang Zheng, Yifan Deng, Dongyu Xue, Yi Zhou, Fei Ye, Quanquan Gu

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show that LM-DESIGN improves the state-of-the-art results by a large margin, leading to 4% to 12% accuracy gains in sequence recovery (e.g., 55.65%/56.63% on CATH 4.2/4.3 single-chain benchmarks, and >60% when designing protein complexes).
Researcher Affiliation Collaboration 1Byte Dance Research 2Dept. of Computer Science, University of Wisconsin-Madison (work was done during Yifan s internship at Byte Dance Research).
Pseudocode Yes Figure 11. Illustration of our instantiation of the structural adapter
Open Source Code No The paper states that ESM-1b and Protein MPNN are "openly accessible", but it does not provide an explicit statement or a link for the open-source code of LM-DESIGN, the methodology described in this paper.
Open Datasets Yes We mainly compared LM-DESIGN against recent strong baselines on the CATH 4.2 (Orengo et al., 1997) dataset, using the same data splits as the compared systems... To compare with ESM-IF (Hsu et al., 2022), we also conducted evaluations on CATH 4.3... Uni Ref50 (Suzek et al., 2015)... SWISS-PROT (Boeckmann et al., 2003) in our experiment.
Dataset Splits Yes where proteins were partitioned by the CATH 4.2 topology classification, resulting in 18024 proteins for training, 608 proteins for validation, and 1120 proteins for testing. CATH 4.3, wherein 16153 structures are assigned to the training set, 1457 to the validation set, and 1797 to the test set.
Hardware Specification Yes The models were trained up to 100 epochs by default using the Adam optimizer on NVIDIA V100s.
Software Dependencies No The paper mentions several software tools and models used (e.g., "Adam optimizer", "Alpha Fold 2", "MMseqs2", "DSSP", "Biopython"), but it does not specify explicit version numbers for these software dependencies, which are crucial for reproducibility.
Experiment Setup Yes The models were trained up to 100 epochs by default using the Adam optimizer on NVIDIA V100s. We used the same training settings as Protein MPNN (Dauparas et al., 2022), where the batch size was set to approximately 6000 residues, and Adam optimizer (Kingma & Ba, 2015) with noam learning rate scheduler (Vaswani et al., 2017) was used.