Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Neurosymbolic Diffusion Models

Authors: Emile van Krieken, Pasquale Minervini, Edoardo Maria Ponti, Antonio Vergari

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Across both synthetic and real-world benchmarks including high-dimensional visual path planning and rule-based autonomous driving NESYDMS achieve state-of-the-art accuracy among Ne Sy predictors and demonstrate strong calibration. 4 Experiments We aim to answer the following research questions: (RQ1:) Can NESYDMS scale to highdimensional reasoning problems? and (RQ2:) Does the expressiveness of NESYDMS improve reasoning shortcut awareness compared to independent models? ... In Table 3, we find that NESYDM strikes a good balance between accuracy and RS-awareness throughout the datasets.
Researcher Affiliation	Collaboration	Emile van Krieken1 Pasquale Minervini1,2, Edoardo Ponti1, Antonio Vergari1, 1School of Informatics, University of Edinburgh 2Miniml.AI EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1 Algorithm for estimating the gradients of the NELBO for training NESYDM Algorithm 2 Standard time-discretised output prediction for NESYDM
Open Source Code	Yes	Code is available at https://github.com/HEmile/neurosymbolic-diffusion.
Open Datasets	Yes	Across both synthetic and real-world benchmarks including high-dimensional visual path planning and rule-based autonomous driving NESYDMS achieve state-of-the-art accuracy... RSBench suite of visual reasoning problems [11] Multidigit MNIST Addition. The input x is a sequence of 2 numbers of N digits... BDD-OIA (BDD) is a self-driving task [87] where a model predicts what actions a car can take given a dashcam image.
Dataset Splits	Yes	For the random search, we used fixed ranges for each parameter, from which we sample log-uniformly. For the parameter β we sampled uniformly instead. We used a budget of 30 random samples for each problem, although for some problems we needed more when we found the ranges chosen were poor. For the MNIST Addition, we split the training dataset in a training dataset of 50.000 samples and a validation dataset 10.000 samples before creating the addition dataset. We tune with this split, then again train 10 times with the optimised parameters on the full training dataset of 60.000 samples for the reported test accuracy. We performed hyperparameter tuning on the validation set of the 12 x 12 grid size problem, then reused the same hyperparameters for the 30 x 30 grid size problem.
Hardware Specification	Yes	For all experiments, we used GPU computing nodes, each with a single lower-end GPU. In particular, we used NVIDIA Ge Force GTX 1080 Ti and GTX 2080 Ti GPUs. All our experiments were run with 12 CPU cores, although this was not the bottleneck in most experiments.
Software Dependencies	No	NESYDM is implemented in Py Torch. We used RAdam [45] for all experiments except for MNIST Addition, where we used Adam [35].
Experiment Setup	Yes	G.1 Hyperparameter tuning We list all hyperparameters in Table 4. We perform random search over the hyperparameters on the validation set of the benchmark tasks. For the random search, we used fixed ranges for each parameter, from which we sample log-uniformly. For the parameter β we sampled uniformly instead. We used a budget of 30 random samples for each problem, although for some problems we needed more when we found the ranges chosen were poor. Table 5: Hyperparameters for MNIST Addition and Warcraft Path Planning.