Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Progressive Inference-Time Annealing of Diffusion Models for Sampling from Boltzmann Densities

Authors: Tara Akhound-Sadegh, Jungyoon Lee, Joey Bose, Valentin De Bortoli, Arnaud Doucet, Michael Bronstein, Dominique Beaini, Siamak Ravanbakhsh, Kirill Neklyudov, Alexander Tong

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, PITA enables, for the first time, equilibrium sampling of N-body particle systems, Alanine Dipeptide, and Tripeptide in Cartesian coordinates with dramatically fewer energy function evaluations. Code available at: https://github.com/taraak/pita. (...) We test the empirical performance of PITA on standard N-body particle systems and short peptides in Alanine Dipeptide and tripeptides.
Researcher Affiliation	Collaboration	1Mc Gill University, 2Mila Quebec AI Institute, 3Université de Montréal, 4University of Oxford, 5Google Deep Mind, 6AITHYRA, 7Valence Labs, 8Institut Courtois
Pseudocode	Yes	Algorithm 1 Training for single temperature 1/βi+1
Open Source Code	Yes	Code available at: https://github.com/taraak/pita.
Open Datasets	Yes	We evaluate PITA on molecular conformation sampling tasks including a toy Lennard-Jones system of 13 particles (LJ-13) and Alanine peptide systems of varying sizes (Alanine Dipeptide and Tripeptide) in Cartesian coordinate space. (...) All of our experiments are on public datasets and energies.
Dataset Splits	Yes	For MD data on ALDP and AL,3, we run two chains one for training and one for test. We use the same simulation parameters for both. For training data we sample shorter chains more frequently (every 100 md steps). To conserve disk space for long test chains, we save every 10k steps. Further parameters can be found in Table 11 and Table 12.
Hardware Specification	Yes	We run our experiments on H100 GPUs and the exact run time, energy function query cost, and overall run time are outlined in the experimental setup.
Software Dependencies	No	The paper discusses various models and methods like EGNN (Satorras et al., 2021) and Di T (Peebles and Xie, 2023) but does not provide specific version numbers for software libraries or environments used for implementation, which is required for a 'Yes' answer.
Experiment Setup	Yes	For LJ-13, we use equal loss weights for energy pinning, denoising score matching, and EBM distillation. We use the noise schedule of Karras et al. (2022), with the following parameters: σmin = 0.05, σmax = 80 and ρ = 7. The model uses EGNN (Satorras et al., 2021) with approximately 90k parameters, consisting of three layers and a hidden dimension of 32. For ALDP and AL3, the energy pinning, denoising score matching, and EBM distillation components of the loss are weighted equally at 1.0, with an additional target score matching loss weighted at 0.01. We use the same noise schedule as the LJ-13 experiment, using a smaller σmin of 0.01. We use Di T (Peebles and Xie, 2023) comprising six layers and six attention heads, with a hidden size of 192 and a total of roughly 12 million parameters. All models are trained with a learning rate of 1 10 3 without any weight decay. For ALDP and AL3, we use Exponential Moving Average (EMA) with a decay rate of 0.999, updating every gradient step.