Diffusion Language Models Are Versatile Protein Learners

Authors: Xinyou Wang, Zaixiang Zheng, Fei Ye, Dongyu Xue, Shujian Huang, Quanquan Gu

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate DPLM on extensive generative and understanding tasks, spanning unconditional generation ( 4.1), a variety of protein predictive downstream tasks ( 4.2), and conditional tasks, including motif-scaffolding ( 4.3.1), inversefolding task ( 4.3.2), and secondary structure guided controllable generation ( 4.3.3).
Researcher Affiliation Collaboration Dept. of Computer Science, Nanjing University (this work was done during Xinyou s internship at Byte Dance Research) Byte Dance Research.
Pseudocode Yes Algorithm 1 Sampling from RDM
Open Source Code No The paper does not provide an explicit statement of open-source code release for DPLM or a direct link to its repository.
Open Datasets Yes The pre-training procedure for DPLM utilizes the Uni Ref50 database (Suzek et al., 2015), which comprises around 45 million protein sequences, totaling about 14 billion amino acid tokens.
Dataset Splits No The paper mentions using Uni Ref50 for pre-training and fine-tuning on various downstream tasks (Thermostability, Metal Ion Binding, Deep Loc, EC, GO, Human PPI) which are often standard benchmarks, but it does not explicitly provide the training, validation, and test splits used for these experiments within the paper's text.
Hardware Specification No The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types) used for running its experiments.
Software Dependencies No The paper mentions various software components and models like 'ESMFold' and 'Omega Fold' which were used, but it does not provide specific version numbers for these or other software dependencies required to reproduce the experiments.
Experiment Setup Yes We train all models for 100K updates, with batch size of 320K for 150M model and 1M for 650M/3B models.