Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

SpecMER: Fast Protein Generation with K-mer Guided Speculative Decoding

Authors: Thomas Walton, Darin Tsui, Aryan Musharaf, Amirali Aghazadeh

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we detail experiments across seven different proteins. We tested Spec MER with Pro Gen2 [4], an autoregressive protein language model. For each experiment, we conditioned generation on a fixed context length from a given wild-type protein as detailed in Table 1. We selected proteins with varying molecular functions and lengths to ensure robustness of testing. Each experiment consisted of generating 200 sequences on an NVIDIA RTX A6000 GPU.
Researcher Affiliation	Academia	Thomas A. Walton Georgia Institute of Technology EMAIL Darin Tsui Georgia Institute of Technology EMAIL Aryan Musharaf Georgia Institute of Technology EMAIL Amirali Aghazadeh Georgia Institute of Technology EMAIL
Pseudocode	Yes	Algorithm 1 Token-level maximal coupling [11] Input: Distributions p, q, Draft sample X p. Compute the residual distribution pres where x V, pres(x) = q(x) min{p(x),q(x)} 1 P x V min{p(x ),q(x )}. Sample η U(0, 1) if η min 1, q(X) p(X) then Return Y = X. {Accept the draft token.} else Return Y pres. {Correct the token by sampling from the residual distribution.} end if
Open Source Code	Yes	Software for Spec MER is available at https://github.com/ amirgroup-codes/Spec MER.git. Code for Spec MER is publicly available here: https://github.com/amirgroup-codes/ Spec MER.git.
Open Datasets	Yes	We selected seven proteins with varying functions and collected their MSA from Protein Gym [20].
Dataset Splits	No	The paper mentions using a "context length to roughly 10% of the wild-type sequence" for conditional generation, but this is a parameter for the generation task rather than a traditional train/test/validation split of a dataset for model training or evaluation of the model itself. The models (Pro Gen2-S and Pro Gen2-M) are pre-trained, and the paper does not specify how the MSA data from Protein Gym [20] was split for any training or evaluation purposes.
Hardware Specification	Yes	For instance, generating 20,000 protein sequences of length 200 amino acids using Pro Gen2-XL, a 6.4-billion-parameter transformer-based autoregressive model, takes approximately 65 hours using a single NVIDIA A6000 GPU. Each experiment consisted of generating 200 sequences on an NVIDIA RTX A6000 GPU. No model training is performed in this work; inference is run on a server with eight NVIDIA A6000 GPUs.
Software Dependencies	No	The paper references "Pro Gen2" models (Pro Gen2-S, Pro Gen2-M, Pro Gen2-XL) as the draft and target models, but does not specify version numbers for general software dependencies like Python, PyTorch, or CUDA, which would be necessary for a reproducible description of ancillary software.
Experiment Setup	Yes	We swept over the following hyperparameters: draft tokens γ {5, 10, 15}, temperatures T {0.7, 1, 1.4}, and k-mers k {(1), (3), (1, 3), (1, 3, 5)}. Sequences were sampled using nucleus (top-p) sampling, setting p = 0.95. The final hyperparameter set used to report results in Table 2 is listed in Table 6.