Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Steering Generative Models with Experimental Data for Protein Fitness Optimization

Authors: Jason Yang, Wenda Chu, Daniel Khalil, Raul Astudillo, Bruce J. Wittmann, Frances Arnold, Yisong Yue

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this study, we explore fitness optimization using small amounts (hundreds) of labeled sequencefitness pairs and comprehensively evaluate strategies such as classifier guidance and posterior sampling for guiding generation from different discrete diffusion models of protein sequences. We also demonstrate how guidance can be integrated into adaptive sequence selection akin to Thompson sampling in Bayesian optimization, showing that plug-and-play guidance strategies offer advantages over alternatives such as reinforcement learning with protein language models. Overall, we provide practical insights into how to effectively steer modern generative models for next-generation protein fitness optimization.
Researcher Affiliation	Collaboration	Jason Yang Chemistry & Chemical Engineering California Institute of Technology Wenda Chu Computing & Mathematical Sciences California Institute of Technology Daniel Khalil Computing & Mathematical Sciences California Institute of Technology Raul Astudillo Computing & Mathematical Sciences California Institute of Technology Bruce J. Wittmann Office of the Chief Scientific Officer Microsoft Corporation Frances H. Arnold Chemistry & Chemical Engineering Biology & Biological Engineering California Institute of Technology Yisong Yue Computing & Mathematical Sciences California Institute of Technology
Pseudocode	Yes	Pseudocode for our adaptive optimization algorithm is provided in Section A.5.
Open Source Code	Yes	To support future research and real-world adoption, our extensive, user-friendly code is available at https://github.com/jsunn-y/SGPO.
Open Datasets	Yes	We study three proteins, the Trp B enzyme (Johnston et al., 2024), the Crei LOV fluorescent protein (Chen et al., 2023c), and the GB1 binding protein (Olson et al., 2014) due to the availability of fitness data across many residues (Table 2).
Dataset Splits	Yes	For Trp B, residues outside of the design space of 15 residues were naively mapped to the original amino acid type in the parent sequence at the end of generation. ... Namely, we used all of the single, double, and triple mutants in the library for training, with 10% and 20% of the quadruple mutants being used for validation and testing, respectively. ... For GB1, the experimental finesses for nearly all double mutations across the entire protein were available, where fitness refers to binding affinity of a domain of the G protein. To train the oracle, we held out 10% and 20% the sequences with two mutations as a validation and test set, respectively, with remaining sequences being used for training.
Hardware Specification	Yes	Pretraining/finetuning to obtain each initial prior was achieved on a single H100 GPU in less than one hour while each individual guidance experiment took minutes; pretraining language models took several hours on a single GPU.
Software Dependencies	No	No specific software versions are mentioned for libraries or programming languages used in the experimental setup in the main text or appendix.
Experiment Setup	Yes	Table 3: Summary of generative priors evaluated in this work. Each generative prior was trained on an MSA of homologous natural sequences. All denoising processes were modeled using a transformer architecture (Section A.3). Italicized models were further explored in downstream guidance experiments. Table A1: Summary of training details for generative priors in this work. Reference refers to the codebase that was modified for our implementation and where the model architecture was adapted from. For all models, we retained the model with the lowest validation loss. When using the ESM encoder, we used the 35M-parameter ESM2 model Lin et al. (2023).