Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Predicting mutational effects on protein binding from folding energy

Authors: Arthur Deng, Karsten D. Householder, Fang Wu, K. Christopher Garcia, Brian L. Trippe

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To evaluate STAB-DDG, we first analyze the contributions of different techniques that lead to an improvement in zero-shot Gbind prediction accuracy, without training on Gbind data. Next, we introduce baseline methods and show that STAB-DDG is the only DL approach to match Fold X and Flex dd G; an ensemble constructed by averaging Fold X and STAB-DDG provides state-of-the-art performance. Finally, we evaluate out-of-distribution accuracy of our approach on two additional binding strength datasets: one consisting of de novo designed small protein binders, and a second consisting of T cell receptor (TCR) mimic proteins we curate.
Researcher Affiliation	Academia	Arthur Deng 1 Karsten Householder 1 Fang Wu 1 K. Christopher Garcia 1 Brian Trippe 1 1Stanford University. Correspondence to: Arthur Deng <EMAIL>, Brian Trippe <EMAIL>.
Pseudocode	No	The paper describes the methodology using narrative text and mathematical equations, but it does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	1Code: https://github.com/LDeng0205/Sta B-dd G
Open Datasets	Yes	with experimental G measurements for fewer than 350 distinct interfaces in the largest public curated dataset (Jankauskait e et al., 2019).
Dataset Splits	Yes	We cluster the complexes using the original SKEMPIv2.0 clusters based on structural homology near the binding site, resulting in 64 disjoint clusters (Jankauskait e et al., 2019). Then, we perform a random splitting to obtain 20 clusters with 1,491 mutants across 81 complexes as our test set. We report these clusters and split at https://github.com/LDeng0205/Sta B-dd G/blob/main/data/ SKEMPI/train_clusters.txt and https://github.com/LDeng0205/Sta B-dd G/blob/main/data/ SKEMPI/test_clusters.txt.
Hardware Specification	Yes	For Sta B-dd G, by contrast, predictions on the same dataset took 13 NVIDIA-5090 GPU-minutes with batched computation (0.2 seconds per mutation). Model finetuning of STAB-DDG took 10 hours and 5 hours on the Megascale stability dataset and the SKEMPIv2.0 training split on a single H100 GPU.
Software Dependencies	Yes	We use Rosetta version 3.8 with 35,000 backrub steps and average predictions across 10 models. For Fold X, initial repair steps are computed on the wild-type interface PDB followed by scoring of individual mutants. We use Fold X version 4.1.
Experiment Setup	Yes	In summary, we fine-tuned on the Megascale stability dataset using the ADAM optimizer with a learning rate of 3e-5 for 70 epochs with a batch size of 25,000 amino acids. We fine-tuned on SKEMPIv2.0 using the ADAM optimizer with learning rate 1e-6 for 200 epochs with a batch size of 25,000 amino acids.