Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Understanding protein function with a multimodal retrieval-augmented foundation model

Authors: Timothy Truong Jr, Tristan Bepler

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To evaluate Po ET-2 s zero-shot and supervised variant effect prediction capabilities, we utilize the Protein Gym benchmark [5]. Protein Gym assesses the performance of a model by measuring its ability to predict variant effect in two types of datasets: (1) deep mutational scanning (DMS) datasets, which encompass over 200 distinct assays measuring the effect of mutations on a wide variety of proteins and protein functions spanning the tree of life, and (2) clinical datasets measuring the pathogenicity of mutations on >2,500 human genes. Following Protein Gym conventions, we use Spearman s rank correlation coefficient (ρ) between experimental measurements and predicted fitness as the primary metric for continuous variables, and area under the receiver operating curve (AUROC) for binary variables.
Researcher Affiliation	Industry	Timothy F. Truong Jr Open Protein.AI NY, USA EMAIL Tristan Bepler Open Protein.AI NY, USA EMAIL
Pseudocode	Yes	Algorithm 1 embed_inputs embeds a single sequence or a sequence-of-sequences... Algorithm 2 encoder_layer... Algorithm 3 encoder... Algorithm 4 decoder_layer... Algorithm 5 decoder
Open Source Code	No	Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: Code and model weights are planned for future public release.
Open Datasets	Yes	Po ET-2 is trained on 62 million sets of homologous sequences. Each set corresponds to a sequence in Uni Ref50 Version 2304 [38], and contains all of its homologs in Uni Ref50 found using Diamond [39]. Each sequence may optionally be associated with a predicted structure from AFDB by matching on the Uni Ref100 identifier. ... We utilize the Protein Gym benchmark [5]. ... Appendix F: Licenses for existing assets: Protein Gym benchmark: MIT license Uni Ref protein database: CC BY 4.0 license Alpha Fold database: CC BY 4.0 license
Dataset Splits	Yes	This benchmark assesses generalization ability across three cross-validation (CV) schemes, varying in difficulty based on the relationship between training and test set mutation locations. In the random fold, mutations are distributed randomly across five CV folds. In the modulo fold, protein positions are assigned to one of five CV folds using a modulo-based strategy i.e. every fifth position belongs to the same fold. In the contiguous fold, the protein sequence is divided into five contiguous, equal-length segments, each constituting a CV fold.
Hardware Specification	Yes	Po ET-2 is trained for 3 million steps on 8 x A100 GPUs with 40GB VRAM each. ... Inference with Po ET-2 is performed on g5.xlarge instances from Amazon Web Services. The instances are equipped with A10G Nvidia GPUs with 24GB VRAM. ... The computation of the SVD of embeddings from protein foundation models is performed on r6a.4xlarge instances from Amazon Web Services. These instances are equipped with 16 v CPUs and 128GB of RAM.
Software Dependencies	No	The paper mentions using specific tools and datasets like Uni Ref50 Version 2304, Diamond [39], Alpha Fold DB, and the Colab Fold MSA protocol [40]. It also refers to the Adafactor [44] optimizer. However, it does not specify general software environment components like Python version, PyTorch/TensorFlow version, or CUDA version, which are typically required for precise reproducibility.
Experiment Setup	Yes	Po ET-2 is 182 million parameter model, structured with 12 layers and a 1024 hidden dimension in its encoder and decoders. ... Context sequence tokens are masked with a random masking rate chosen uniformly from 0%-30%. Query sequence tokens are randomly masked with a random masking rate chosen uniformly from 0%-100%. Decoder sequence tokens are masked with a random masking rate chosen uniformly from 0%-30%. ... Po ET-2 is trained with the same optimizer and learning rate schedule as Po ET-1 [1]. Namely, the optimizer is Adafactor [44], and the learning rate schedule consists of a linear warmup over the first 4000 steps to a peak learning rate of 1e 2, and then a square root decay over the remaining training steps. ... A batch size of 45056 tokens is used per GPU with gradient accumulation over two steps, for an effective batch size of 90112 tokens per GPU.