Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Generative property enhancer: implicit guided generation through conditional density estimation

Authors: Pedro O O. Pinheiro, Pan Kessel, Aya Ismail, Sai Pooja Mahajan, Kyunghyun Cho, Saeed Saremi, Nataša Tagasovska

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate competitive empirical results on standard in silico offline (non-sequential) protein fitness optimization benchmarks. Finally, we propose iterative training on a combination of limited real data and self-generated synthetic data, enabling extrapolation beyond the original property ranges.
Researcher Affiliation	Collaboration	Pedro O. Pinheiro1 Pan Kessel1 Aya A. Ismail2, Sai Pooja Mahajan1 Kyunghyun Cho1,3 Saeed Saremi1 Nataša Tagasovska1 1Prescient Design, Genentech, 2Guide Labs, 3New York University
Pseudocode	Yes	Algorithm 1 in appendix describes a simple pseudo-code for this procedure. Algorithm 1: Iterative Sampling Algorithm 2: Iterative training on self-generated data
Open Source Code	Yes	Source code provided to the submission and an updated version will be released if accepted.
Open Datasets	Yes	Datasets. We evaluate the performance of our model on two important protein subdomain datasets: adeno-associated virus (AAV) [70] and green fluorescent protein (GFP) [71].
Dataset Splits	Yes	We consider the two splits proposed by the authors medium and hard splits with 2,139/3,448 samples for AAV and 2,828/2,426 for GFP. The medium split is a subset of the data containing the 20th-40th percentiles that are 6 edit distances or more from any sample in the optimal fitness set. The hard split contains the lowest 30th percentiles that are 7 mutations or more away from the optimal fitness set.
Hardware Specification	Yes	The models on this paper were trained using single A100 Nvidia GPUs and 4 CPU workers per model.
Software Dependencies	No	The paper mentions using Adam optimizer and provides citations for third-party code implementations (Kirjner et al. [14], NVlabs/edm2, facebookresearch/flow_matching) but does not provide specific version numbers for Python, PyTorch, or other libraries used in their own implementation.
Experiment Setup	Yes	For the iterative sampling, we start with 128 seeds from the training set (similar to other baselines) and sample designs according to Algorithm 1 for K = 20 iterations. At each iteration, we sample a pool of M = 2560 designs and reject the repeated ones and those that have a Levenshtein distance larger than 10 from any seed. On the final iteration, we randomly pick 128 samples from the last pool of designs. m VAE. ... We have one layer of one-hot encoding, followed by 3-layer MLP resnet blocks with internal layers of size 128. Each m VAE was trained with Adam, learning rate 1e 4 for AAV and antibodies, and, 1e 5 for GFP, and train for 500 epochs each. m WJS. ... We chose a noise level σ=.5... The denoiser model has a total of 3.8M parameters, and it is trained with batch size of 256, learning rate 1e 3, Adam [78] optimizer and a total of 5,000/1,000 epochs for AAV and GFP, respectively. m FM. Similar to the m WJS variant, we train for 5,000/1,000 epochs for AAV and GFP, respectively, we use the learning rate of 1e 3, Adam optimizer and batch size 256.