Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

RvLLM: LLM Runtime Verification with Domain Knowledge

Authors: Yedi Zhang, Sun Emma, Annabelle En, Jin Song Dong

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate Rv LLM on three representative tasks: violation detection against Singapore Rapid Transit Systems Act, numerical comparison, and inequality solving. Experimental results show that Rv LLM effectively detects erroneous outputs across various LLMs in a lightweight and flexible manner. The results reveal that despite their impressive capabilities, LLMs remain prone to low-level errors due to a lack of formal guarantees during inference, and our framework offers a potential long-term solution by leveraging expert domain knowledge to rigorously and efficiently verify LLM outputs.
Researcher Affiliation	Academia	Yedi Zhang National University of Singapore Singapore Sun Yi Emma National University of Singapore Singapore Annabelle Lee Jia En National University of Singapore Singapore Jin Song Dong National University of Singapore Singapore
Pseudocode	No	To achieve it, given a rule set ΓR and an initially defined literals Lit with truth values, we first construct an initial graph G through the following steps: Step 1 (Literal Nodes): Create a node for each literal that appears in the rule set ΓR. Step 2 (LHS Node): Create an LHS node for each rule-like propositional formula in ΓR. Step 3 (Edges): For each formula in ΓR, such as l1 ln ln+1, add a directed edge from each literal node li (i [n]) to the corresponding LHS node, and add a directed edge from this LHS node to the literal node ln+1. Next, according to the initially defined literals, we mark all the literal nodes valued True as Lit and mark all the literal nodes valued False as Lit , based on the truth value of literals from Lit. Then, we perform forward chaining procedure on the graph G and update Lit and Lit as follows until no more update can be done or an inconsistency is detected by Lit Lit = : While these steps describe a procedure, they are presented as a textual description rather than a formally structured pseudocode or algorithm block.
Open Source Code	Yes	Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: The code and data will be open-sourced and made available on our website once they are properly organized. We also upload the preliminary version (of code) as a zip file as supplemental materials in the submission.
Open Datasets	Yes	We conduct experiments using contextual scenarios derived from [58], consisting of 304 cases in total 281 labeled as involving violations (unsafe) and 23 as not (safe). To incorporate relevant domain knowledge, we carefully encode 31 of the 53 regulations from Singapore Rapid Transit System Act into ESL specifications. It took a research scientist three days to interpret the law and develop corresponding specifications a one-time effort applicable for verifying any LLM application against these regulations. In this study, the same model serves both as the target LLM and as the perception agent. ... We compile a dataset of 40 inequality questions sourced from A-Level H2 Mathematics examination papers [32] and carefully design three specifications to serve as domain knowledge that Rv LLM uses for verification.
Dataset Splits	No	For this case study, we apply Rv LLM under the Level-1 interpretation to evaluate the LLM s responses in detecting violations within contextual scenarios against the Singapore Rapid Transit Systems Act. In this capacity, Rv LLM can be regarded as a complementary mechanism to enhance the LLM s capability in detecting the law violations in this study. We conduct experiments using contextual scenarios derived from [58], consisting of 304 cases in total 281 labeled as involving violations (unsafe) and 23 as not (safe). ... Without loss of generality, we randomly generate 100 questions following the guideline for increasing the likelihood of incorrect comparison results by LLMs. The specification file used here is the one shown in Figure 2. Again, we use the same model to serve both the target LLM and the perception agent. ... We compile a dataset of 40 inequality questions sourced from A-Level H2 Mathematics examination papers [32] and carefully design three specifications to serve as domain knowledge that Rv LLM uses for verification. The paper describes the number of cases or questions used for evaluation (e.g., 304 cases, 100 questions, 40 questions) and their labels or generation methods, but does not specify explicit training/test/validation splits for model training or evaluation in the traditional sense.
Hardware Specification	Yes	All experiments are conducted on a machine with an Intel (R) Xeon(R) w7-2475X processor.
Software Dependencies	No	The paper lists various LLMs used for benchmarking (e.g., Qwen 2.5, GPT 4.1, Gemini 2.0, Deep Seek-V3) but does not provide specific version numbers for software libraries, frameworks, or programming languages used to implement the Rv LLM framework itself.
Experiment Setup	Yes	A dedicated guideline for designing interpretation abstraction prompts for the perception agent is provided in the appendices, along with additional experimental details and descriptions of the datasets and specifications used throughout the experiments. ... For this case study, we apply Rv LLM under the Level-1 interpretation to evaluate the LLM s responses in detecting violations within contextual scenarios against the Singapore Rapid Transit Systems Act. ... For this case study, we apply Rv LLM under Level-2 interpretation to verify the numerical comparison results produced by various LLMs.