Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Learning Latent Variable Models via Jarzynski-adjusted Langevin Algorithm

Authors: James Cuin, Davide Carbone, O. Deniz Akyildiz

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the performance of JALA-EM on a variety of latent variable models and show that it performs comparably to existing methods in terms of accuracy and computational efficiency. Importantly, the ability to recursively estimate marginal likelihoods an uncommon feature among scalable methods makes our approach particularly suited for model selection, which we validate through dedicated experiments. ... 5 Experimental results
Researcher Affiliation	Academia	James Cuin Department of Mathematics Imperial College London London, UK EMAIL; Davide Carbone Laboratoire de Physique de l Ecole Normale Supérieure, Université PSL, CNRS, Sorbonne Université, Université de Paris Paris, France EMAIL; O. Deniz Akyildiz Department of Mathematics Imperial College London London, UK EMAIL
Pseudocode	Yes	Algorithm 1 JALA-EM
Open Source Code	Yes	The code can be found in https://github.com/jamescuin/jala-em.
Open Datasets	Yes	To benchmark JALA-EM s performance, relative to that of the particle gradient descent (PGD) and stochastic optimization via unadjusted Langevin algorithm (SOUL) algorithms, as introduced in Kuntz et al. (2023) and De Bortoli et al. (2021) respectively, we consider the extensively studied Bayesian logistic regression task using the Wisconsin Breast Cancer dataset, as described in De Bortoli et al. (2021). ... To begin, we focus on the setup of the binary classification problem, since the multi-class setting is a natural extension. To be clear, using the MNIST handwritten digit dataset, the task is to distinguish between images of the digits 4 and 9
Dataset Splits	Yes	Lastly, note that the pre-processed dataset, D = (X, y), was then split into a training-validation set, Dtrain,val and a hold-out test set Dtest, via an 80-20 stratified split, so that class proportions are maintained across said split. ... we perform K-fold cross-validation on Dtrain,val to select the step-sizes. Specifically, this tuning is conducted over a predefined grid for the particle update step-sizes and, where applicable (i.e. for SOUL and JALA-EM), for the θ-update step-sizes. The evaluation metric utilised for this hyperparameter tuning, within the cross-validation folds, is the Log Pointwise Predictive Density (LPPD), which assesses the model s average predictive accuracy on unseen data. ... For the Bayesian logistic regression model, this bound takes the form ... and to this end we compute the worst-case upper bound on the Hessian, Hbound, to obtain a single, globally relevant Lipschitz constant, L, for determining heuler. Indeed, this is a constant matrix that is larger, in the positive semi-definite sense, than any Hessian encountered, and thus its largest eigenvalue provides an upper bound on L valid across the entire parameter space. In fact, for the Bayesian logistic regression model, this bound takes the form
Hardware Specification	Yes	Experiments were run on a personal computer and a Google Colab T4 GPU.
Software Dependencies	No	In contrast to the Bayesian logistic regression experiment (see Appendix C.2), this experiment utilises a global, fixed step-size of 0.1, for all algorithms, rather than tuning them via K-fold cross-validation, as this is an example where computational (or expertise) limitations prohibit comprehensive fine-tuning. As such, all algorithms, including JALA-EM, are implemented in JAX.
Experiment Setup	Yes	To be clear, in the case of SOUL, this refers to the number of outer steps, whereas the number of inner steps is determined by N. Also, note that for JALA-EM, we choose C = 1/1.05 and utilise systematic resampling in cases in which this threshold is breached, as recommended in Carbone et al. (2023). ... Regarding the configuration of JALA-EM, we initialise model parameters perturbed from their true values, so that θM,0 = (log σ2 +1, log α +1, . . . ) = (1, 1, . . . ), where we set log ν0 = log ν +1 = log(4.0) + 1 when G = MT , and log ν0 = log(5.0) when G = MG, corresponding to the upper limit of our constraint for ν. The algorithm is run for K = 250 iterations, using N = 50 particles, with a Langevin dynamic step-size of h = 5e-5, while the parameter optimisation learning rate is η = 5e-3, where OPT is in fact Adam (Kingma and Ba, 2015), with β1 = 0.9, to demonstrate optimisers other than SGD can be leveraged.