Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Elucidating the Design Space of Multimodal Protein Language Models
Authors: Cheng-Yen Hsieh, Xinyou Wang, Daiheng Zhang, Dongyu Xue, Fei Ye, Shujian Huang, Zaixiang Zheng, Quanquan Gu
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our advancements approach finer-grained supervision, demonstrating that tokenbased multimodal PLMs can achieve robust structural modeling. The effective design methods dramatically improve the structure generation diversity, and notably, folding abilities of our 650M model by reducing the RMSD from 5.52 to 2.36 on PDB testset, even outperforming 3B baselines and on par with the specialized folding models. |
| Researcher Affiliation | Collaboration | School of Computer Science, Nanjing University 2Dept. of ECE, Rutgers University Byte Dance Seed (this work was done during Xinyou Wang and Daiheng Zhang s internship at Byte Dance Seed). Correspondence to: Quanquan Gu <EMAIL>. |
| Pseudocode | No | The paper includes architectural diagrams in figures (e.g., Figure 1, 2, 3, 4) and describes procedures in paragraph text, but does not contain any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Project page and code: bytedance.github.io/dplm/dplm-2.1. |
| Open Datasets | Yes | Our design methods allow multimodal PLMs to achieve robust structural understanding, improving the folding RMSD from 5.52 to 2.36 on the PDB date dataset, outperforming 3B folding baselines with only 650M parameters. We evaluate the structure prediction performance on the folding task. As shown in Table 4, the residual diffusion module is capable of improving the structural prediction accuracy by refining fine-grained structural variations based on language model predictions. Moreover, we observe that the residual diffusion module is model-agnostic, showing consistent performance improvements across different DPLM-2 variants. The Fig. 7 demonstrates that the residual diffusion module performs fine-grained refinements on the local structure, optimizing interatomic distances to facilitate the formation of plausible secondary structures. Specifically, we select tokenizers from DPLM-2 and ESM3, training separate DPLM-2 variants with the same architecture but using their respective structure token codebook. These models are evaluated on the CAMEO 2022 test set for both reconstruction and protein folding performance. |
| Dataset Splits | Yes | Our design methods allow multimodal PLMs to achieve robust structural understanding, improving the folding RMSD from 5.52 to 2.36 on the PDB date dataset, outperforming 3B folding baselines with only 650M parameters. Specifically, we select tokenizers from DPLM-2 and ESM3, training separate DPLM-2 variants with the same architecture but using their respective structure token codebook. These models are evaluated on the CAMEO 2022 test set for both reconstruction and protein folding performance. PDB date split. PDBMultimer 11614/291 2.88 1.66 661.57 416.37 229.39 167.00. We excluded multi-chain proteins with lengths outside the range of [60, 512], resulting in 3462 training samples in PDB-Multimer. |
| Hardware Specification | Yes | We report the training time of DPLM-2 variants with either bit-based modeling or hybrid approach on 16 H100s for 300k training steps in Table 19. |
| Software Dependencies | No | The paper mentions software components such as nn.Parameter, Python, PyTorch, and CUDA, but does not provide specific version numbers for these dependencies, which are required for full reproducibility. |
| Experiment Setup | Yes | During training, the learning rate is warmed up over the first 2,000 steps to a peak value of 1 × 10^−4 and then linearly decayed to 1 × 10^−5. We train the residual diffusion module for 100,000 steps with a batch size of 240. We employ 2,000 warmup steps until reaching the maximum learning rate 1 × 10^−4, and utilize a linear decay scheduler to decay LR to 1 × 10^−5. The overall training process consists of 300,000 steps. |