Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Generating and Checking DNN Verification Proofs

Authors: Hai Duong, ThanhVu Nguyen, Matthew Dwyer

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our work on a benchmark of 400 verification problems involving 16 networks, including large models (up to 1.7M parameters) ( 5), and demonstrate that APTP and APTPchecker are robust to variation in proof structure arising from different DNN verification algorithms.
Researcher Affiliation	Academia	Hai Duong George Mason University Fairfax, VA 22030 EMAIL Thanh Vu Nguyen George Mason University Fairfax, VA 22030 EMAIL Matthew B. Dwyer University of Virginia Charlottesville, VA 22904 EMAIL
Pseudocode	Yes	Alg. 1. The Ba BNV algorithm with proof generation. input :DNN N, property ϕin ϕout output :(unsat, proof) if property is valid, otherwise (sat, cex)
Open Source Code	Yes	APTPchecker is available at: https://github.com/dynaroars/APTPchecker.
Open Datasets	Yes	We evaluate on UNSAT verification problems selected from the benchmark suite introduced in [9], which includes ACAS Xu, RESNET_A/B, CIFAR2020, MNISTFC, and MNIST_GDVB.
Dataset Splits	Yes	For each network, we randomly sampled local robustness properties until we obtained 25 UNSAT instances, yielding 200 CNN and 200 FNN problems (400 total) as shown in Tab. 1.
Hardware Specification	Yes	All experiments were run on a Linux machine with an AMD Threadripper 64-core 4.2GHZ CPU, 128GB RAM, and an NVIDIA Ge Force RTX 4090 GPU with 24 GB VRAM.
Software Dependencies	No	APTPchecker is written in Python, and uses Gurobi [35] for LP solving. The core APTPchecker algorithm (Alg. 2) consists of 600 SLOC, while optimizations use an additional 200 SLOC.
Experiment Setup	Yes	Metrics To assess performance we use the two common metrics in the verification community [4]: (i) time to solve and (ii) number of problems solved. We record time to verify, generate, and check proofs, using a 1000-second timeout.