reproducibilityindex.ai

Structure-informed Language Models Are Protein Designers

Authors: Zaixiang Zheng, Yifan Deng, Dongyu Xue, Yi Zhou, Fei Ye, Quanquan Gu

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments show that LM-DESIGN improves the state-of-the-art results by a large margin, leading to 4% to 12% accuracy gains in sequence recovery (e.g., 55.65%/56.63% on CATH 4.2/4.3 single-chain benchmarks, and >60% when designing protein complexes).
Researcher Affiliation	Collaboration	1Byte Dance Research 2Dept. of Computer Science, University of Wisconsin-Madison (work was done during Yifan s internship at Byte Dance Research).
Pseudocode	Yes	Figure 11. Illustration of our instantiation of the structural adapter
Open Source Code	No	The paper states that ESM-1b and Protein MPNN are "openly accessible", but it does not provide an explicit statement or a link for the open-source code of LM-DESIGN, the methodology described in this paper.
Open Datasets	Yes	We mainly compared LM-DESIGN against recent strong baselines on the CATH 4.2 (Orengo et al., 1997) dataset, using the same data splits as the compared systems... To compare with ESM-IF (Hsu et al., 2022), we also conducted evaluations on CATH 4.3... Uni Ref50 (Suzek et al., 2015)... SWISS-PROT (Boeckmann et al., 2003) in our experiment.
Dataset Splits	Yes	where proteins were partitioned by the CATH 4.2 topology classification, resulting in 18024 proteins for training, 608 proteins for validation, and 1120 proteins for testing. CATH 4.3, wherein 16153 structures are assigned to the training set, 1457 to the validation set, and 1797 to the test set.
Hardware Specification	Yes	The models were trained up to 100 epochs by default using the Adam optimizer on NVIDIA V100s.
Software Dependencies	No	The paper mentions several software tools and models used (e.g., "Adam optimizer", "Alpha Fold 2", "MMseqs2", "DSSP", "Biopython"), but it does not specify explicit version numbers for these software dependencies, which are crucial for reproducibility.
Experiment Setup	Yes	The models were trained up to 100 epochs by default using the Adam optimizer on NVIDIA V100s. We used the same training settings as Protein MPNN (Dauparas et al., 2022), where the batch size was set to approximately 6000 residues, and Adam optimizer (Kingma & Ba, 2015) with noam learning rate scheduler (Vaswani et al., 2017) was used.