reproducibilityindex.ai

Lemur: Integrating Large Language Models in Automated Program Verification

Authors: Haoze Wu, Clark Barrett, Nina Narodytska

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	5 EXPERIMENTS We have presented the LEMUR calculus and described a sound and terminating algorithm based on LEMUR. In this section, we investigate the following questions: Can we develop a practical automated verification procedure based on Alg 1? [Yes] Is LEMUR competitive with existing end-to-end learning-based verification approaches? [Yes] Can LEMUR already prove hard benchmarks that are beyond the reach of state-of-the-art conventional program verifiers? [In several cases] and Table 1: Solved instances by ESBMC, LEMUR, and Code2Inv (1a) or UAUTOMIZER (1b) on two benchmark sets. We also report the average time and number of proposals for solved instances.
Researcher Affiliation	Collaboration	Haoze Wu , Clark Barrett Department of Computer Science Stanford University {haozewu, barrett}@stanford.edu Nina Narodytska VMware Research VMware by Broadcom n.narodytska@gmail.com
Pseudocode	Yes	Algorithm 1 The LEMUR procedure
Open Source Code	Yes	3The source code and the benchmarks are publicly available at https://github.com/wu-haoze/ Lemur-program-verification.
Open Datasets	Yes	3The source code and the benchmarks are publicly available at https://github.com/wu-haoze/ Lemur-program-verification. and We focus on benchmarks with less than 150 tokens (after removing comments and unnecessary headers, and after applying clang-formatting).We select 47 benchmarks that ESBMC and UAUTOMIZER are unable to solve within 10 minutes. The property is expected to hold in all benchmarks. We use a 15-minute per-instance timeout.
Dataset Splits	No	The paper uses established benchmark sets (Code2Inv, SV-COMP) but does not describe any specific training, validation, or test dataset splits created or used by the authors for their own experiments. The benchmarks are used as test instances.
Hardware Specification	No	The paper mentions using Open AI's GPT models (GPT-3.5 turbo, GPT-4) and off-the-shelf verifiers (ESBMC, UAUTOMIZER) with time limits. However, it does not provide any specific details about the hardware (e.g., CPU, GPU models, memory, or cluster specifications) on which the authors ran their experiments.
Software Dependencies	No	The paper mentions using specific external tools like ESBMC (Gadelha et al., 2018) and UAUTOMIZER (Heizmann et al., 2013), and Open AI's GPT models (GPT-3.5 turbo, GPT-4). While these are specific tools/models, the paper does not provide specific version numbers for the verifiers or other ancillary software components used in their experimental setup, which is required for reproducibility.
Experiment Setup	Yes	By default, we impose a 30-second time limit for each invocation of the verifier. That is, if the verifier does not terminate within 30 seconds, then the verifier returns UNKNOWN. and We use a 15-minute per-instance timeout. and A heuristic we found useful is to prompt GPT multiple times and order the properties by the frequency with which they are proposed (breaking ties by preferring shorter expressions).