Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Protein Design with Dynamic Protein Vocabulary

Authors: Nuowei Liu, Jiahao Kuang, Yanting Liu, Tao Ji, Changzhi Sun, Man Lan, Yuanbin Wu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our empirical results show that even random incorporation of fragments improves foldability. Building on this insight, we introduce PRODVA... Experimental results demonstrate that our approach effectively designs protein sequences that are both functionally aligned and structurally plausible.
Researcher Affiliation	Collaboration	1 School of Computer Science and Technology, East China Normal University 2 English Department, College of Foreign Languages and Literatures, Fudan University 3 Institute of Artificial Intelligence (Tele AI), China Telecom
Pseudocode	No	The paper describes the model architecture and training process in detail through text and diagrams, but it does not include any explicitly labeled pseudocode blocks or algorithms.
Open Source Code	Yes	Datasets and codes are publicly available at https://github.com/sornkL/ProDVa.
Open Datasets	Yes	Datasets and codes are publicly available at https://github.com/sornkL/ProDVa. ... CAMEO targets released between May 1, 2020 and August 1, 2023 [36]. ... We utilize the Mol Instructions [37] protein design-related instructions as the test set.
Dataset Splits	Yes	We randomly sample 5% of this dataset as the validation set, with the remaining used for training. ... Additionally, we allocate 10% of the Mol-Instructions protein design-related training set for validation.
Hardware Specification	No	The computations in this research were performed using the CFFF platform of Fudan University. No specific hardware (GPU/CPU models, memory) details are provided.
Software Dependencies	No	The Text Language Model is initialized with GPT-2 [22]... Both the Protein Language Model and the Fragment Encoder are initialized with Prot GPT2 [23]... We employ Pub Med BERT [24] as the embedding model... We employ the txtai framework... supported by the Faiss [25] backend. ... Adam W optimizer [53]. No specific version numbers for these software components are provided.
Experiment Setup	Yes	PRODVA is trained using the Adam W optimizer [53] with β1 = 0.9, β2 = 0.95, and a gradient clipping of 1.0. The maximum learning rate is set to 1 10 4, with a linear warmup over the first 5% of training steps. ... The minibatch size is set to 4 to ensure memory efficiency, with an overall batch size of 64. During training, the weights of the Text Language Model are frozen. Training is conducted for 10K steps on the CAMEO subset and 20K steps on Mol-Instructions. The loss weights are set to α = β = 0.2.