Bridge-IF: Learning Inverse Protein Folding with Markov Bridges

Authors: Yiheng Zhu, Jialu Wu, Qiuyi Li, Jiahuan Yan, Mingze Yin, Wei Wu, Mingyang Li, Jieping Ye, Zheng Wang, Jian Wu

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on well-established benchmarks demonstrate that Bridge-IF predominantly surpasses existing baselines in sequence recovery and excels in the design of plausible proteins with high foldability.
Researcher Affiliation Collaboration 1College of Computer Science & Technology and Liangzhu Laboratory, Zhejiang University 2College of Pharmaceutical Sciences, Zhejiang University 3Alibaba Cloud Computing 4School of Public Health, Zhejiang University 5School of Artificial Intelligence and Data Science, University of Science and Technology of China 6The Second Affiliated Hospital Zhejiang University School of Medicine {zhuyiheng2020, jialuwu, jyansir, yinmingze, wujian2000}@zju.edu.cn {liqiuyi.lqy, sangheng.lmy, yejieping.ye, wz388779}@alibaba-inc.com urara@mail.ustc.edu.cn
Pseudocode Yes The overall workflow of the training and sampling process are provided in Algorithm 1 and Algorithm 2.
Open Source Code Yes The code is available at https://github.com/violet-sto/Bridge-IF.
Open Datasets Yes We conduct experiments on both CATH v4.2 and CATH v4.3, where proteins are categorized based on the CATH hierarchical classification of protein structure, to ensure a comprehensive analysis. Following the standard data splitting provided by Ingraham et al. [25], CATH v4.2 dataset consists of 18,024 proteins for training, 608 proteins for validation, and 1,120 proteins for testing. Following the standard data splitting provided by Hsu et al. [22], CATH v4.3 dataset consists of 16,153 proteins for training, 1,457 proteins for validation, and 1,797 proteins for testing.
Dataset Splits Yes Following the standard data splitting provided by Ingraham et al. [25], CATH v4.2 dataset consists of 18,024 proteins for training, 608 proteins for validation, and 1,120 proteins for testing. Following the standard data splitting provided by Hsu et al. [22], CATH v4.3 dataset consists of 16,153 proteins for training, 1,457 proteins for validation, and 1,797 proteins for testing.
Hardware Specification Yes The model is trained up to 50 epochs by default on an NVIDIA 3090.
Software Dependencies No The paper mentions using specific models like ESM-1b and ESM-2, and optimizers like Adam, but does not provide specific version numbers for software libraries or dependencies.
Experiment Setup Yes We use the cosine schedule [39] with number of timestep T = 25. The model is trained up to 50 epochs by default on an NVIDIA 3090. We used the same training settings as Protein MPNN [6], where the batch size was set to approximately 6000 residues, and Adam optimizer [30] with noam learning rate scheduler [51] was used.