Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Consistent Sampling and Simulation: Molecular Dynamics with Energy-Based Diffusion Models

Authors: Michael Plainer, Hao Wu, Leon Klein, Stephan Günnemann, Frank Noe

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we evaluate the consistency of diffusion models by comparing samples obtained through classical denoising (iid) with those generated by sequential Langevin simulations (sim) using forces derived with the score force relation from Equation (6). In other words, we verify the alignment of Equation (7) for energy-based models and show superior performance. We demonstrate our approach (Score MD) on three biomolecular systems alanine dipeptide, Chignolin, and BBA and introduce a transferable model that generalizes across dipeptides, improving over existing state-of-the-art Boltzmann generators (Klein and Noé, 2024). The code, model weights, and self-contained notebooks in JAX and PyTorch are publicly available at https://github.com/noegroup/ScoreMD. Metrics. As the main metric of interest, we compare the 2D free energy surfaces of the equilibrium distributions obtained by different methods. For dipeptides, we project the data onto the dihedral angles φ and ψ, while for proteins we perform time-lagged independent component analysis (TICA) (Pérez-Hernández et al., 2013) on bond distances and dihedral angles to recover two representative coordinates. To quantify differences between free energy surfaces, we report the potential of mean force (PMF) error (Durumeric et al., 2024), which measures the squared distance between the negative logarithms of the sampled and reference densities in the projected space. This metric places higher weight on low-density regions compared to alternatives such as the Jensen-Shannon (JS) divergence. Additional details are provided in Appendix B.2.
Researcher Affiliation	Collaboration	1Freie Universität Berlin 2Zuse School ELIZA 3Technische Universität Berlin 4Berlin Institute for the Foundations of Learning and Data 5Shanghai Jiao Tong University 6Technische Universität München 7Rice University 8Microsoft Research AI4Science
Pseudocode	No	The paper describes methods and equations, but does not present any structured pseudocode or algorithm blocks. The derivations in Appendix A are mathematical rather than algorithmic.
Open Source Code	Yes	Our code, model weights, and self-contained JAX and PyTorch notebooks are available at https://github.com/noegroup/ScoreMD.
Open Datasets	Yes	For alanine dipeptide, we use 50k samples from an MD simulation in implicit solvent (Köhler et al., 2021), coarse-grained to five atoms: [C, N, CA, C, N]. For Chignolin and BBA, we use the dataset from Lindorff-Larsen et al. (2011), coarse-grained to one bead per amino acid (10 and 28 residues, respectively), and use 80% of the samples for training. For the iid setting, we generate the same number of samples as in the training set. ... The alanine dipeptide datasets is available as part of the public bgmol (MIT licence) repository here: https://github.com/noegroup/bgmol. ... The original dipeptide dataset (2AA) was introduced in Klein et al. (2023a) (MIT License) and is available here: https://huggingface.co/datasets/microsoft/timewarp. As this includes a lot of intermediate saved states and quantities, like energies, there is a smaller version made available by Klein and Noé (2024) (CC BY 4.0): https://osf.io/n8vz3/?view_only= 1052300a21bd43c08f700016728aa96e.
Dataset Splits	Yes	For Chignolin and BBA, ... and use 80% of the samples for training. ... It is split into 175 train, 75 validation, and 92 test dipeptides, out of which we have used 15 for the results presented in the paper (also the metrics) to reduce inference time.
Hardware Specification	Yes	We have used a single RTX 3090 GPU for the toy systems, an A100 with 80GB memory for alanine dipeptide, two A100 80GB GPUs for the dipeptide dataset and Chignolin, and four for BBA. Depending on availability, we have also used H100 for some experiments.
Software Dependencies	Yes	We perform NVT dynamics with the Langevin integrator as implemented in open MM (Eastman et al., 2017) version 8.2.0. ... In our code, we have used jax (Bradbury et al., 2018) (Apache-2.0) and the accompanying machine learning library flax (Heek et al., 2024) (Apache-2.0). ... For the graph transformer architecture, we have extended code from Arts et al. (2023) (MIT) and have re-implemented the code from https://github.com/lucidrains/graph-transformer-pytorch (MIT) in jax. For the free-energy plots of the Müller-Brown potential, we used Hoffmann et al. (2021) (LGPL-3.0). For trajectories and simulations, we have used open MM (Eastman et al., 2017) (MIT) and mdtraj (Mc Gibbon et al., 2015) (LGPL-2.1).
Experiment Setup	Yes	Architecture. For alanine dipeptide we have used quite a small architecture, where the hyperparameters are listed in Table 3. When multiple parameters are listed for the same model, this means that they are used for the corresponding MoE model. Note that when using MoE, we have mostly used the same model architecture, except that only the Fokker Planck regularized model is conservative. As for the optimizer, we have used AdamW (Loshchilov and Hutter, 2019). ... The hyperparameters are listed in Table 6. When multiple parameters are listed for the same model, this means that they are used for the corresponding MoE model.