Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Robustness Distributions in Neural Network Verification

Authors: Annelot Bosman, Aaron Berger, Holger H. Hoos, Jan N. van Rijn

JAIR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We then analyse the distributions of these critical 𝜀values over a given set of inputs for 12 MNIST classifiers widely used in the literature on neural network verification. Using a Kolmogorov-Smirnov test, we obtain support for the hypothesis that the critical 𝜀values of 11 of these networks follow a log-normal distribution. Furthermore, we found no statistically significant differences between the critical 𝜀distributions for training and testing data for 12 feed-forward neural networks on the MNIST dataset.
Researcher Affiliation	Academia	ANNELOT W. BOSMAN , Leiden University, The Netherlands AARON BERGER, RWTH Aachen University, Germany HOLGER H. HOOS, RWTH Aachen University, Germany and Leiden University, The Netherlands JAN N. VAN RIJN, Leiden University, The Netherlands
Pseudocode	No	The paper describes the 𝑘-binary search algorithm in Section 3.3 "𝑘-binary Search" using natural language and conceptual steps, but does not present a formal, structured pseudocode block or algorithm listing.
Open Source Code	Yes	Lastly, we provide a ready-to-use Python package available on Git Hub that can be used for creating robustness distributions and enables others to build upon our work.1 The package is modular, such that any part can be changed, including the instance set under consideration, the robustness property or the verifiers used. This makes our results fully reproducible and will help others build on our work. Furthermore, all our networks and data are available on Git Hub.2 1See: https://github.com/ADA-research/VERONA 2See: https://github.com/ADA-research/NNV_JAIR_robustness_distributions
Open Datasets	Yes	We analyse the critical 𝜀distributions for 12 widely studied fully-connected MNIST neural networks... We investigate the effect adversarial training can have on the critical 𝜀distribution of various neural networks for MNIST, CIFAR and GTSRB datasets.
Dataset Splits	Yes	Following the work of König et al. [25], we used the first 100 instances from the MNIST training and testing sets, respectively... For both CIFAR-10 and GTSRB, we randomly selected 100 testing and training images each, with random seed 42. Given that the GTSRB dataset contains 42 classes, we performed stratified random selection.
Hardware Specification	Yes	All experiments were carried out on a cluster of machines, each equipped with 2 Intel Xeon E5-2683 CPUs with 32 cores, 40MB cache size and 94GB of RAM.
Software Dependencies	Yes	We used Python 3.10 with Cent OS 7.0.
Experiment Setup	Yes	We ran 𝑘-binary search with 200 𝜀values, ranging from 0.001 to 0.4, in intervals of 0.002, i.e., (0.001, 0.003, . . . , 0.397, 0.399)... The time-out for each of these queries was set to one hour. For MNIST, we used a perturbation of 0.2 and for CIFAR-10 and GTSRB 8/255. For PGD training... For MNIST, we used a perturbation of 0.3, and for CIFAR-10 and GTSRB, we used a perturbation of 8/255... For all training methods, we performed hyperparameter optimisation using Optuna [1]; the final hyperparameter values can be found in Appendix I, Tables 21 and 22.