Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Zero-shot protein stability prediction by inverse folding models: a free energy interpretation

Authors: Jes Frellsen, Maher Kassem, Tone Bengtsen, Lars Olsen, Kresten Lindorff-Larsen, Jesper Ferkinghoff-Borg, Wouter Boomsma

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To evaluate the effects of the different assumptions and approximations discussed in the previous section, we conclude the paper with a series of experiments in which the individual terms are estimated using available computational methods on a representative selection of protein datasets. In the following sections, we discuss these choices in turn. We will initially conduct our experiments using the pretrained ESM inverse folding (ESM-IF) model (Hsu et al., 2022), as it has been shown to perform well in a zero-shot setting (Notin et al., 2023). An ablation with Protein MPNN is discussed subsequently. Our primary analysis considers three different data sets. The first is a high-quality data set measuring the thermodynamic stability of nearly all variants of a single 56-residue protein, the B1 domain of Protein G (hereafter called Protein G; Nisthal et al., 2019).
Researcher Affiliation	Collaboration	Jes Frellsen Technical University of Denmark Maher M. Kassem Novonesis Tone Bengtsen Novonesis Lars Olsen Novonesis Kresten Lindorff-Larsen University of Copenhagen Jesper Ferkinghoff-Borg Novo Nordisk Wouter Boomsma University of Copenhagen
Pseudocode	No	The paper describes methods through mathematical derivations and prose, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code for reproducing the experiments is accessible at: https://github.com/MachineLearningLifeScience/inverse_folding_free_energies
Open Datasets	Yes	The first is a high-quality data set measuring the thermodynamic stability of nearly all variants of a single 56-residue protein, the B1 domain of Protein G (hereafter called Protein G; Nisthal et al., 2019). The second is an older benchmark set, compiled for the Fold X prediction method by Guerois et al. (2002), mostly consisting of entries from the Pro Therm database (Gromiha et al., 1999). Finally, we include data generated using variant abundance by massively parallel sequencing (VAMP-seq) experiments that probe stability only indirectly, by quantifying the variant abundance in cultured cells using a combination of fluorescent tags and sequencing (Matreyek et al., 2018). For our last experiment, we employ a subset of the mega-scale stability dataset (Tsuboyama et al., 2023) as included in Protein Gym (Notin et al., 2023).
Dataset Splits	No	The paper describes the datasets and their filtering, but does not explicitly provide information on training, validation, or test splits for any experiments conducted within the paper. For instance, for the Guerois dataset: "The Guerois data set (Guerois et al., 2002) contains 988 entries that we filtered to contain only single amino acid substitutions (i.e. no double and triple substitutions). 911 entries remained after filtering and are associated to 40 PDB structures."
Hardware Specification	No	The experiments in this paper comprise running Monte Carlo and molecular dynamics simulations for 40 proteins, in addition to model evaluation of pretrained models on all samples. Since no training was involved, no large scale GPU-resources were necessary for this study.
Software Dependencies	No	The paper mentions several tools and models like "ESM inverse folding (ESM-IF) model (Hsu et al., 2022)", "Open MM framework (Eastman et al., 2017)", and "Phaistos framework (Boomsma et al., 2013)", but does not provide specific version numbers for the software dependencies themselves (e.g., Python, PyTorch, or specific versions of these frameworks).
Experiment Setup	Yes	Using the Open MM framework (Eastman et al., 2017), 20 ns simulations were conducted at 300 K using 2 femtosecond time steps with the Langevin integrator, combined with the Amber 14 force field with a TIP3P water model, adding counter ions to assure overall neutrality. See appendix B.2 for details on the choice of simulation ensemble. ... We simulated segments with five flanking amino acids on each side of the position of interest, running for 10,000 iterations, where each 100th structure was saved.