Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Let a Neural Network be Your Invariant

Authors: Mirco Giacobbe, Daniel Kroening, Abhinandan Pal, Michael Tautschnig

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We experimentally demonstrate that our method is orders of magnitude faster than the state-of-the-art model checkers on pure liveness and combined safety and liveness verification tasks written in System Verilog... We implemented our method using a new neural ranking function architecture... We extended the prior benchmark with standard safety and combined liveness and safety verification problems, and have compared our method with the state-of-the-art hardware model checkers. ... Our contribution is threefold. ... Third, by evaluating our method across a 634-task benchmark suite, we demonstrate that our method outperforms the state of the art in automated formal verification.
Researcher Affiliation	Collaboration	Mirco Giacobbe University of Birmingham, UK Zeroth Research, UK Daniel Kroening Amazon Web Services, USA Abhinandan Pal University of Birmingham, UK National Institute of Informatics Tokyo, Japan Michael Tautschnig Amazon Web Services, USA Queen Mary University of London, UK
Pseudocode	No	The paper describes the learn-check workflow in Section 3 and the MILP encoding in Section 4 using descriptive text and mathematical equations, but does not present structured pseudocode or algorithm blocks.
Open Source Code	Yes	For reproducibility and further implementation details, our code and benchmarks will be made publicly available. ... Alongside the paper, we provide a zip archive as per NeurIPS guidelines. It includes all benchmarks, experiment scripts, and a README.md with detailed instructions for reproducing our results.
Open Datasets	Yes	Building on the ten parameterised System Verilog designs and 194 pureliveness tasks of prior work [63], we add an eleventh design and, for all eleven, contribute additional pure-safety and combined safety liveness specifications on each task, yielding a 634-task suite that retains the original diversity (Appendix A). ... Alongside the paper, we provide a zip archive as per NeurIPS guidelines. It includes all benchmarks, experiment scripts, and a README.md with detailed instructions for reproducing our results.
Dataset Splits	No	Our procedure incrementally constructs two datasets Dinit and Dtrans to train V to respectively satisfy conditions (6) and (7). The dataset Dinit contains sample initial register values reg s of M where s S0, and the dataset Dtrans contains sample transitions of M A ϕ in the form of quadruples (reg s, q, reg s , q ) where (s, q) M A ϕ (s , q ). Initially, we assume that Dinit and Dtrans are empty. Our learning component computes a set of parameters {θq}q Q to hold over the set of samples Dinit and Dtrans. Our checking component formally verifies whether conditions (6) and (7) hold over the entire state space. If the checker confirms the conditions, then the procedure halts successfully. In the converse case, it provides sets of counterexamples Cinit and Ctrans, which are respectively added to Dinit and Dtrans, and learning and checking are repeated in a loop. The paper does not specify fixed training/test/validation splits, but rather an iterative counterexample-guided synthesis approach to populate datasets.
Hardware Specification	Yes	All experiments ran on an AWS m6a.4xlarge instance (16 v CPUs, 64 GB RAM, Ubuntu 24.04).
Software Dependencies	Yes	We implemented a prototype using a modular Python 3.12 pipeline 2. Given an LTL formula Φ, we obtained A Φ with Spot 2.11.6 [51], converted System Verilog to bitvector SMT using EBMC 5.6 [91], performed validity queries and computed counterexamples using Bitwuzla 0.7.0 [96], and trained a neural partially-ranking function using the MILP solver Gurobi 12.0.1 [65].
Experiment Setup	Yes	Our method relies on a single global hyperparameter configuration, besides the integer bound P (see Section 4). We chose P from a finite set of configurations {1, 5, 10, M/10 , M/2 , M, M+1, 2M}, where M is the largest value any register can take in the hardware design; this renders the bounds for zi1j and vj quadratic in M except when P = 1, 5, or 10. We configured the architecture to start with a linear model (per q A ϕ) and add one hidden layer with a single neuron (left branch of Figure 2a) upon the first failure of the learning phase, subsequently increasing its width by one neuron upon subsequent failures, up to 5 attempts.