Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Elucidating the Design Space of Multimodal Protein Language Models

Authors: Cheng-Yen Hsieh, Xinyou Wang, Daiheng Zhang, Dongyu Xue, Fei Ye, Shujian Huang, Zaixiang Zheng, Quanquan Gu

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our advancements approach finer-grained supervision, demonstrating that tokenbased multimodal PLMs can achieve robust structural modeling. The effective design methods dramatically improve the structure generation diversity, and notably, folding abilities of our 650M model by reducing the RMSD from 5.52 to 2.36 on PDB testset, even outperforming 3B baselines and on par with the specialized folding models.
Researcher Affiliation	Collaboration	School of Computer Science, Nanjing University 2Dept. of ECE, Rutgers University Byte Dance Seed (this work was done during Xinyou Wang and Daiheng Zhang s internship at Byte Dance Seed). Correspondence to: Quanquan Gu <EMAIL>.
Pseudocode	No	The paper includes architectural diagrams in figures (e.g., Figure 1, 2, 3, 4) and describes procedures in paragraph text, but does not contain any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Project page and code: bytedance.github.io/dplm/dplm-2.1.
Open Datasets	Yes	Our design methods allow multimodal PLMs to achieve robust structural understanding, improving the folding RMSD from 5.52 to 2.36 on the PDB date dataset, outperforming 3B folding baselines with only 650M parameters. We evaluate the structure prediction performance on the folding task. As shown in Table 4, the residual diffusion module is capable of improving the structural prediction accuracy by refining fine-grained structural variations based on language model predictions. Moreover, we observe that the residual diffusion module is model-agnostic, showing consistent performance improvements across different DPLM-2 variants. The Fig. 7 demonstrates that the residual diffusion module performs fine-grained refinements on the local structure, optimizing interatomic distances to facilitate the formation of plausible secondary structures. Specifically, we select tokenizers from DPLM-2 and ESM3, training separate DPLM-2 variants with the same architecture but using their respective structure token codebook. These models are evaluated on the CAMEO 2022 test set for both reconstruction and protein folding performance.
Dataset Splits	Yes	Our design methods allow multimodal PLMs to achieve robust structural understanding, improving the folding RMSD from 5.52 to 2.36 on the PDB date dataset, outperforming 3B folding baselines with only 650M parameters. Specifically, we select tokenizers from DPLM-2 and ESM3, training separate DPLM-2 variants with the same architecture but using their respective structure token codebook. These models are evaluated on the CAMEO 2022 test set for both reconstruction and protein folding performance. PDB date split. PDBMultimer 11614/291 2.88 1.66 661.57 416.37 229.39 167.00. We excluded multi-chain proteins with lengths outside the range of [60, 512], resulting in 3462 training samples in PDB-Multimer.
Hardware Specification	Yes	We report the training time of DPLM-2 variants with either bit-based modeling or hybrid approach on 16 H100s for 300k training steps in Table 19.
Software Dependencies	No	The paper mentions software components such as nn.Parameter, Python, PyTorch, and CUDA, but does not provide specific version numbers for these dependencies, which are required for full reproducibility.
Experiment Setup	Yes	During training, the learning rate is warmed up over the first 2,000 steps to a peak value of 1 × 10^−4 and then linearly decayed to 1 × 10^−5. We train the residual diffusion module for 100,000 steps with a batch size of 240. We employ 2,000 warmup steps until reaching the maximum learning rate 1 × 10^−4, and utilize a linear decay scheduler to decay LR to 1 × 10^−5. The overall training process consists of 300,000 steps.