Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Critically Assessing the State of the Art in Neural Network Verification

Authors: Matthias König, Annelot W. Bosman, Holger H. Hoos, Jan N. van Rijn

JMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To answer this question, we performed a systematic performance analysis of several CPU- and GPU-based local robustness veriﬁcation systems on a newly and carefully assembled set of 79 neural networks, of which we veriﬁed a broad range of robustness properties, while taking a practitioner s point of view a perspective that complements the insights from initiatives such as the VNN competition, where the participating tools are carefully adapted to the given benchmarks by their developers. Notably, we show that no single best algorithm dominates performance across all veriﬁcation problem instances. Instead, our results reveal complementarities in veriﬁer performance and illustrate the potential of leveraging algorithm portfolios for more eﬃcient local robustness veriﬁcation. We quantify this complementarity using various performance measures, such as the Shapley value.
Researcher Affiliation	Academia	1Leiden Institute of Advanced Computer Science, Leiden University, The Netherlands 2Chair for AI Methodology, RWTH Aachen University, Germany 3University of British Columbia, Canada
Pseudocode	No	The paper describes various algorithmic approaches (e.g., SMT solvers, mixed integer programming, branch-and-bound) and concepts (e.g., symbolic interval propagation, zonotope abstraction) but does not present any of these in a structured pseudocode or algorithm block format.
Open Source Code	No	The paper provides a public repository for experimental data and the network collection, and references specific commits for third-party verification frameworks used (β-CROWN, OVAL Ba B, MN-Ba B), but it does not provide the source code for the authors' own analysis methodology or benchmarking setup.
Open Datasets	Yes	Overall, our benchmark is comprised of 79 image classiﬁcation networks, of which 38 are trained on the CIFAR-10 dataset and 41 are trained on the MNIST dataset. ... lastly, we provide a public repository containing all experimental data, along with the newly assembled network collection.1 1. https://github.com/ADA-research/nn-verification-assessment
Dataset Splits	No	Of each network, we veriﬁed 100 local robustness properties; more precisely, we sampled 100 images from the dataset on which the network has been trained and veriﬁed for local robustness with the perturbation radius ϵ set at {0.004, 0.005, 0.008, 0.01, 0.012, 0.02, 0.025, 0.03, 0.04}. The paper mentions sampling 100 images but does not specify how these images were split from the original datasets (e.g., training, test, validation splits) for reproducibility, nor does it provide a specific random seed for the sampling process.
Hardware Specification	Yes	Our experiments were carried out on a cluster of machines equipped with Intel Xeon E5-2683 CPUs with 32 cores, 40 MB cache size and 94 GB RAM, running Cent OS Linux 7. Each veriﬁcation method was limited to using a single CPU core per run. ... For GPU-accelerated methods, we used machines equipped with NVIDIA Ge Force GTX 1080 Ti GPUs with 11 GB video memory.
Software Dependencies	Yes	Generally, we executed the veriﬁcation algorithms through the DNNV interface, version 0.4.8. ... The GPU-based methods we considered are not supported by DNNV. Hence, we used the standalone implementations of these algorithms through the β-CROWN2, OVAL Ba B3, and MN-Ba B4 framework, respectively. 2. Commit 7a46097192207dfbb2fa7135857d6cd5 3. Commit 9e1606044759da5693f226ce489e9d4dded21bd6 4. Commit 2aa12b145bb61342f4c464b64be3467b3a275e46
Experiment Setup	Yes	All methods were employed with their default hyperparameter settings, as they would likely be used by practitioners. ... Each query (i.e., attempt to solve a veriﬁcation problem instance) was given a time budget of 3 600 seconds and a memory budget of 3 GB. ... we sampled 100 images from the dataset on which the network has been trained and veriﬁed for local robustness with the perturbation radius ϵ set at {0.004, 0.005, 0.008, 0.01, 0.012, 0.02, 0.025, 0.03, 0.04}.