Diffusion Language Models Are Versatile Protein Learners
Authors: Xinyou Wang, Zaixiang Zheng, Fei Ye, Dongyu Xue, Shujian Huang, Quanquan Gu
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate DPLM on extensive generative and understanding tasks, spanning unconditional generation ( 4.1), a variety of protein predictive downstream tasks ( 4.2), and conditional tasks, including motif-scaffolding ( 4.3.1), inversefolding task ( 4.3.2), and secondary structure guided controllable generation ( 4.3.3). |
| Researcher Affiliation | Collaboration | Dept. of Computer Science, Nanjing University (this work was done during Xinyou s internship at Byte Dance Research) Byte Dance Research. |
| Pseudocode | Yes | Algorithm 1 Sampling from RDM |
| Open Source Code | No | The paper does not provide an explicit statement of open-source code release for DPLM or a direct link to its repository. |
| Open Datasets | Yes | The pre-training procedure for DPLM utilizes the Uni Ref50 database (Suzek et al., 2015), which comprises around 45 million protein sequences, totaling about 14 billion amino acid tokens. |
| Dataset Splits | No | The paper mentions using Uni Ref50 for pre-training and fine-tuning on various downstream tasks (Thermostability, Metal Ion Binding, Deep Loc, EC, GO, Human PPI) which are often standard benchmarks, but it does not explicitly provide the training, validation, and test splits used for these experiments within the paper's text. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types) used for running its experiments. |
| Software Dependencies | No | The paper mentions various software components and models like 'ESMFold' and 'Omega Fold' which were used, but it does not provide specific version numbers for these or other software dependencies required to reproduce the experiments. |
| Experiment Setup | Yes | We train all models for 100K updates, with batch size of 320K for 150M model and 1M for 650M/3B models. |