Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Protein Design with Dynamic Protein Vocabulary
Authors: Nuowei Liu, Jiahao Kuang, Yanting Liu, Tao Ji, Changzhi Sun, Man Lan, Yuanbin Wu
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our empirical results show that even random incorporation of fragments improves foldability. Building on this insight, we introduce PRODVA... Experimental results demonstrate that our approach effectively designs protein sequences that are both functionally aligned and structurally plausible. |
| Researcher Affiliation | Collaboration | 1 School of Computer Science and Technology, East China Normal University 2 English Department, College of Foreign Languages and Literatures, Fudan University 3 Institute of Artificial Intelligence (Tele AI), China Telecom |
| Pseudocode | No | The paper describes the model architecture and training process in detail through text and diagrams, but it does not include any explicitly labeled pseudocode blocks or algorithms. |
| Open Source Code | Yes | Datasets and codes are publicly available at https://github.com/sornkL/ProDVa. |
| Open Datasets | Yes | Datasets and codes are publicly available at https://github.com/sornkL/ProDVa. ... CAMEO targets released between May 1, 2020 and August 1, 2023 [36]. ... We utilize the Mol Instructions [37] protein design-related instructions as the test set. |
| Dataset Splits | Yes | We randomly sample 5% of this dataset as the validation set, with the remaining used for training. ... Additionally, we allocate 10% of the Mol-Instructions protein design-related training set for validation. |
| Hardware Specification | No | The computations in this research were performed using the CFFF platform of Fudan University. No specific hardware (GPU/CPU models, memory) details are provided. |
| Software Dependencies | No | The Text Language Model is initialized with GPT-2 [22]... Both the Protein Language Model and the Fragment Encoder are initialized with Prot GPT2 [23]... We employ Pub Med BERT [24] as the embedding model... We employ the txtai framework... supported by the Faiss [25] backend. ... Adam W optimizer [53]. No specific version numbers for these software components are provided. |
| Experiment Setup | Yes | PRODVA is trained using the Adam W optimizer [53] with β1 = 0.9, β2 = 0.95, and a gradient clipping of 1.0. The maximum learning rate is set to 1 10 4, with a linear warmup over the first 5% of training steps. ... The minibatch size is set to 4 to ensure memory efficiency, with an overall batch size of 64. During training, the weights of the Text Language Model are frozen. Training is conducted for 10K steps on the CAMEO subset and 20K steps on Mol-Instructions. The loss weights are set to α = β = 0.2. |