Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Critically Assessing the State of the Art in Neural Network Verification
Authors: Matthias Kƶnig, Annelot W. Bosman, Holger H. Hoos, Jan N. van Rijn
JMLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To answer this question, we performed a systematic performance analysis of several CPU- and GPU-based local robustness veriļ¬cation systems on a newly and carefully assembled set of 79 neural networks, of which we veriļ¬ed a broad range of robustness properties, while taking a practitioner s point of view a perspective that complements the insights from initiatives such as the VNN competition, where the participating tools are carefully adapted to the given benchmarks by their developers. Notably, we show that no single best algorithm dominates performance across all veriļ¬cation problem instances. Instead, our results reveal complementarities in veriļ¬er performance and illustrate the potential of leveraging algorithm portfolios for more eļ¬cient local robustness veriļ¬cation. We quantify this complementarity using various performance measures, such as the Shapley value. |
| Researcher Affiliation | Academia | 1Leiden Institute of Advanced Computer Science, Leiden University, The Netherlands 2Chair for AI Methodology, RWTH Aachen University, Germany 3University of British Columbia, Canada |
| Pseudocode | No | The paper describes various algorithmic approaches (e.g., SMT solvers, mixed integer programming, branch-and-bound) and concepts (e.g., symbolic interval propagation, zonotope abstraction) but does not present any of these in a structured pseudocode or algorithm block format. |
| Open Source Code | No | The paper provides a public repository for experimental data and the network collection, and references specific commits for third-party verification frameworks used (β-CROWN, OVAL Ba B, MN-Ba B), but it does not provide the source code for the authors' own analysis methodology or benchmarking setup. |
| Open Datasets | Yes | Overall, our benchmark is comprised of 79 image classiļ¬cation networks, of which 38 are trained on the CIFAR-10 dataset and 41 are trained on the MNIST dataset. ... lastly, we provide a public repository containing all experimental data, along with the newly assembled network collection.1 1. https://github.com/ADA-research/nn-verification-assessment |
| Dataset Splits | No | Of each network, we veriļ¬ed 100 local robustness properties; more precisely, we sampled 100 images from the dataset on which the network has been trained and veriļ¬ed for local robustness with the perturbation radius ϵ set at {0.004, 0.005, 0.008, 0.01, 0.012, 0.02, 0.025, 0.03, 0.04}. The paper mentions sampling 100 images but does not specify how these images were split from the original datasets (e.g., training, test, validation splits) for reproducibility, nor does it provide a specific random seed for the sampling process. |
| Hardware Specification | Yes | Our experiments were carried out on a cluster of machines equipped with Intel Xeon E5-2683 CPUs with 32 cores, 40 MB cache size and 94 GB RAM, running Cent OS Linux 7. Each veriļ¬cation method was limited to using a single CPU core per run. ... For GPU-accelerated methods, we used machines equipped with NVIDIA Ge Force GTX 1080 Ti GPUs with 11 GB video memory. |
| Software Dependencies | Yes | Generally, we executed the veriļ¬cation algorithms through the DNNV interface, version 0.4.8. ... The GPU-based methods we considered are not supported by DNNV. Hence, we used the standalone implementations of these algorithms through the β-CROWN2, OVAL Ba B3, and MN-Ba B4 framework, respectively. 2. Commit 7a46097192207dfbb2fa7135857d6cd5 3. Commit 9e1606044759da5693f226ce489e9d4dded21bd6 4. Commit 2aa12b145bb61342f4c464b64be3467b3a275e46 |
| Experiment Setup | Yes | All methods were employed with their default hyperparameter settings, as they would likely be used by practitioners. ... Each query (i.e., attempt to solve a veriļ¬cation problem instance) was given a time budget of 3 600 seconds and a memory budget of 3 GB. ... we sampled 100 images from the dataset on which the network has been trained and veriļ¬ed for local robustness with the perturbation radius ϵ set at {0.004, 0.005, 0.008, 0.01, 0.012, 0.02, 0.025, 0.03, 0.04}. |