Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Scaling Unlocks Broader Generation and Deeper Functional Understanding of Proteins

Authors: Aadyot Bhatnagar, Sarthak Jain, Joel Beazer, Samuel Curran, Alexander Hoffnagle, Kyle Ching, Michael Martyn, Stephen Nayfach, Jeffrey Ruffolo, Ali Madani

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate for the first time in the wet lab the influence of model scale on the sequences generated by PLMs, and we find that larger models generate viable proteins for a much wider diversity of protein families. Finally, we find both computationally and experimentally that larger models are more responsive to alignment with laboratory data, resulting in improved protein fitness prediction and sequence generation capabilities.
Researcher Affiliation	Industry	Profluent Bio, Inc. To whom correspondence should be addressed: EMAIL
Pseudocode	No	The paper describes the architecture and various training tasks and strategies (e.g., inﬁlling training details, position preserving fuzzy encoding, model alignment algorithms like IRPO) using narrative text and mathematical formulations. However, it does not include any clearly labeled pseudocode blocks or algorithms in a structured, code-like format.
Open Source Code	Yes	Code and model weights are available at https://github.com/Profluent-AI/progen3.
Open Datasets	No	We release the model code on Git Hub and weights (up to 3B parameters) on Huggingface. However, we do not release the dataset, training code, and Pro Gen3-46B weights.
Dataset Splits	Yes	To measure model generalization, we construct validation sets distinct from our training data at 30%, 50%, and 90% ID. Each x% ID validation set consists of 2.5M sequences distributed uniformly between X% ID clusters of sizes 1-10, 11-100, and 101-1000. The average loss thus avoids overweighting highly represented parts of protein space and more accurately measures out-of-distribution generalization. We also average the losses on these sets to compute an aggregate validation loss.
Hardware Specification	Yes	We trained all models on H100s hosted by Mosaic ML/Databricks. Pre-training Pro Gen3-46B took approximately 17 days on a cluster of 256x H100. For models smaller than Pro Gen3-46B, we run alignment jobs on 8x H100; each job takes at most 3hr. For Pro Gen3-46B, we require 16x H100, and jobs take 2-6hr, depending on the size of the dataset.
Software Dependencies	No	We implement our models using PyTorch [72]. To improve efficiency, we use Flash Attention-2 [26] and Megablocks [37] for our attention and MoE layers, respectively. Finally, we orchestrate training and data loading with Mosaic ML’s composer [93] and streaming [94] libraries, respectively.
Experiment Setup	Yes	All models are trained using the AdamW optimizer [53, 62] with β1 = 0.9, β2 = 0.95, and BF16 mixed precision [66]. After an initial warmup period, we decay the learning rate to 10% of its peak value following a cosine schedule. We leverage fully sharded data parallel training [76] and gradient checkpointing [18] for memory-efficient distributed training. Table 3 describes our model configurations and pre-training hyperparameters in more detail. (Table 3 includes: Params, Layers, dmodel, Attn Heads, d FFN, LR, WD, BSZ, WU).