Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Generating and Checking DNN Verification Proofs
Authors: Hai Duong, ThanhVu Nguyen, Matthew Dwyer
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our work on a benchmark of 400 verification problems involving 16 networks, including large models (up to 1.7M parameters) ( 5), and demonstrate that APTP and APTPchecker are robust to variation in proof structure arising from different DNN verification algorithms. |
| Researcher Affiliation | Academia | Hai Duong George Mason University Fairfax, VA 22030 EMAIL Thanh Vu Nguyen George Mason University Fairfax, VA 22030 EMAIL Matthew B. Dwyer University of Virginia Charlottesville, VA 22904 EMAIL |
| Pseudocode | Yes | Alg. 1. The Ba BNV algorithm with proof generation. input :DNN N, property Οin Οout output :(unsat, proof) if property is valid, otherwise (sat, cex) |
| Open Source Code | Yes | APTPchecker is available at: https://github.com/dynaroars/APTPchecker. |
| Open Datasets | Yes | We evaluate on UNSAT verification problems selected from the benchmark suite introduced in [9], which includes ACAS Xu, RESNET_A/B, CIFAR2020, MNISTFC, and MNIST_GDVB. |
| Dataset Splits | Yes | For each network, we randomly sampled local robustness properties until we obtained 25 UNSAT instances, yielding 200 CNN and 200 FNN problems (400 total) as shown in Tab. 1. |
| Hardware Specification | Yes | All experiments were run on a Linux machine with an AMD Threadripper 64-core 4.2GHZ CPU, 128GB RAM, and an NVIDIA Ge Force RTX 4090 GPU with 24 GB VRAM. |
| Software Dependencies | No | APTPchecker is written in Python, and uses Gurobi [35] for LP solving. The core APTPchecker algorithm (Alg. 2) consists of 600 SLOC, while optimizations use an additional 200 SLOC. |
| Experiment Setup | Yes | Metrics To assess performance we use the two common metrics in the verification community [4]: (i) time to solve and (ii) number of problems solved. We record time to verify, generate, and check proofs, using a 1000-second timeout. |