Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Energy Matching: Unifying Flow Matching and Energy-Based Models for Generative Modeling

Authors: Michal Balcerak, Tamaz Amiranashvili, Antonio Terpin, Suprosanna Shit, Lea Bogensperger, Sebastian Kaltenbach, Petros Koumoutsakos, Bjoern Menze

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, our method significantly outperforms existing EBMs on both CIFAR-10 and Image Net generation in terms of fidelity, and compares favorably to flow-matching and diffusion models without auxiliary generators or time-dependent EBM ensembles. Furthermore, we leverage the flexibility of the method to introduce an interaction energy that supports the exploration of diverse modes, which we demonstrate in a controlled protein generation setting. We evaluate our approach on CIFAR-10 [Krizhevsky and Hinton, 2009] and Image Net32x32 [Deng et al., 2009, Chrabaszcz et al., 2017] datasets, reporting FID scores in Table 1 and Table 2, respectively.
Researcher Affiliation	Academia	Michal Balcerak University of Zurich EMAIL Tamaz Amiranashvili University of Zurich Technical University of Munich Antonio Terpin ETH Zurich Suprosanna Shit University of Zurich Lea Bogensperger University of Zurich Sebastian Kaltenbach Harvard University Petros Koumoutsakos Harvard University Bjoern Menze University of Zurich
Pseudocode	Yes	Algorithm 1 Phase 1 (warm-up). Algorithm 2 Phase 2 (main training). Algorithm 3 Unconditional/conditional sampling with optional interaction energy
Open Source Code	Yes	a Code repository: https://github.com/m1balcerak/Energy Matching. The NeurIPS checklist also states: "anonymized code bundle illustrating the full training + evaluation pipeline is attached to the submission; the repository and all ancillary assets needed for reproduction will be made openly available upon acceptance."
Open Datasets	Yes	We evaluate our approach on CIFAR-10 [Krizhevsky and Hinton, 2009] and Image Net32x32 [Deng et al., 2009, Chrabaszcz et al., 2017] datasets. We apply our method to a Celeb A [Liu et al., 2015] 64 64 inpainting task. The LID estimates we obtain exhibit stronger correlations with PNG compression size1 (evaluated on 4096 images) using Spearman s correlation. Figure 4 offers qualitative illustrations. Our EBM-based approach compares favorably to diffusion-based methods, as it relies on fewer approximations by performing computations exactly on the data manifold rather than merely in its vicinity. ... Estimating the LID for MNIST [Deng, 2012] and τ = 2 for CIFAR-10. ...protein inverse design problem of generating Adeno-Associated Virus (AAV) capsid protein segments [Bryant et al., 2021]. The NeurIPS checklist provides licenses for all mentioned datasets, confirming their public availability.
Dataset Splits	Yes	We evaluate on two benchmark splits (medium and hard), which correspond to subsets of the original AAV dataset differing in baseline fitness distributions and required mutational distance from known high-performing variants [Kirjner et al., 2024]. The NeurIPS Paper Checklist states: "All key choices data splits, hyper-parameters, optimizers, and selection criteria are documented in the Training Details (Section D)."
Hardware Specification	Yes	We train for 145k iterations using Algorithm 1 ... on 4x A100. (CIFAR-10) ... We train for 640k iterations ... on 7x A100. (Image Net 32x32) ... train for 250k iterations ... on 4x A100. (Celeb A) ... train for 50k iterations ... on a single A100. (MNIST) ... train for 10k iterations ... on a single A100. (AAV) ... Table 4: Comparison of sampling efficiency and quality on CIFAR-10 (batch size 128, NVIDIA R6000 48GB GPU).
Software Dependencies	No	The gradient of the potential, x V (x), is computed using automatic differentiation via Py Torch s autograd [Ansel et al., 2024]. We optimize all models using the Adam optimizer [Kingma and Ba, 2014] and maintain an exponential moving average (EMA) of the model weights. The paper mentions software components but does not provide specific version numbers for them (e.g., PyTorch version, Python version, CUDA version).
Experiment Setup	Yes	CIFAR-10: ... Hyperparameters are: τs = 3.25, τ = 1.0, t = 0.01, MLangevin = 200. We train for 145k iterations using Algorithm 1 with EMA 0.9999 and then 2k more with Algorithm 2 and EMA 0.99 on 4x A100. The batch size is 128, learning rate is 1.2 10 3, εmax = 0.01, λCD = 1 10 3, α = 0.1, and β = 0.02. ... Similar detailed hyperparameters are provided for Image Net 32x32, Celeb A, MNIST, and AAV in Section D.