Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Exponential-Family Harmoniums with Neural Sufficient Statistics

Authors: Azwar Abdulsalam, Joseph G. Makin

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	On these datasets, the NN-EFH achieves FID scores that are 25 50% lower than a standard energy-based model with a similar neural-network architecture and the same number of parameters; and competitive with noise-conditional score networks, which utilize more complex neural networks (U-nets) and require considerably more sampling steps. . . . Quality of Generated Images. We begin with MNIST (32x32). Fig. 3a shows digits generated from an NN-EFH. . . . To quantify the quality of the celebrity faces generated by the NN-EFH, we compare their FID score (Heusel et al. 2017) against similarly sized recent models (Table 1). . . . Table 2 shows that the NN-EFH achieves significantly better scores than both of the other models.
Researcher Affiliation	Academia	Elmore Family School of Electrical and Computer Engineering, Purdue University 465 Northwestern Ave. West Lafayette, IN 47907 EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1: Gibbs-Langevin Training . . . Algorithm 2: Gibbs-Langevin Testing
Open Source Code	No	The paper states: "For training GRBMs in this work, we use their source code and choices for hyperparameters." referring to a third-party tool (Liao et al. 2022), but does not provide specific access to the source code for the methodology described in this paper (NN-EFH).
Open Datasets	Yes	With Gibbs-Langevin, the GRBM can successfully model small data sets like MNIST and Celeb A-32, but struggles with CIFAR-10, and cannot scale to larger images because it lacks convolutions. In contrast, our neural-network EFHs (NN-EFHs) generate highquality samples from CIFAR-10 and scale well to Celeb AHQ. . . . Quality of Generated Images. We begin with MNIST (32x32). . . . Next we turn to Celeb A-HQ (64x64). . . . Finally, we train all three models on CIFAR-10 (60,000 32x32 color images spread across 10 classes, each representing different objects such as animals and vehicles).
Dataset Splits	No	The paper mentions several datasets (e.g., MNIST, CIFAR-10, Celeb A-HQ) and refers to training and testing phases. However, it does not explicitly state the specific split percentages, sample counts, or refer to any predefined standard splits used for training, validation, and testing within the paper.
Hardware Specification	Yes	All models are trained using stochastic gradient descent with the Adam optimizer on V100 GPUs for 50,000 iterations with a batch size of 64.
Software Dependencies	No	The paper mentions using "their source code and choices for hyperparameters" for GRBMs (Liao et al. 2022), but does not specify any software names with version numbers for their own implementation or other ancillary software.
Experiment Setup	Yes	In particular, we employ L = 60 steps of Langevin dynamics within M = 5 steps of Gibbs sampling. We train the baseline EBM using the standard MLE/min-KL loss. In our experience, the best results for EBMs trained with Langevin dynamics on complex datasets are achieved with a step size of ϵ = 1 and temperature T = 5e 5, and we accordingly used these parameters in our Langevin-with-Gibbs when training the NN-EFH and for the (vanilla) Langevin dynamics for the EBM. . . . All models are trained using stochastic gradient descent with the Adam optimizer on V100 GPUs for 50,000 iterations with a batch size of 64.