Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
DafnyBench: A Benchmark for Formal Software Verification
Authors: Chloe R Loughridge, Qinyi Sun, Seth Ahrenbach, Federico Cassano, Chuyue Sun, Ying Sheng, Anish Mudide, Md Rakib Hossain Misu, Nada Amin, Max Tegmark
TMLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We test the ability of LLMs such as GPT-4 and Claude 3 to auto-generate enough annotations for the Dafny formal verification engine to successfully verify over 750 programs with about 53,000 lines of code. The best model and prompting scheme achieved 68% success rate, and we quantify how this rate improves when retrying with error message feedback and how it deteriorates with the amount of required code and annotations. (...) We tested GPT-4o, GPT-4 Turbo Open AI et al. (2024), GPT-3.5 Turbo Brown et al. (2020), Claude 3 Opus Anthropic (2024), and Code Llama-7b-Instruct-hf hug (2022) on the 782-program benchmark. Table 2 shows that Claude 3 Opus performed best, achieving a success rate 68%. |
| Researcher Affiliation | Collaboration | Chloe Loughridge EMAIL Harvard College Qinyi Sun EMAIL Massachusetts Institute of Technology Seth Ahrenbach EMAIL Federico Cassano EMAIL Northeastern University Chuyue Sun EMAIL Stanford University Ying Sheng EMAIL Stanford University Anish Mudide EMAIL Massachusetts Institute of Technology Md Rakib Hossain Misu EMAIL University of California Irvine Nada Amin EMAIL Harvard University Max Tegmark EMAIL Massachusetts Institute of Technology |
| Pseudocode | Yes | Figure 7: Pseudocode for the minhash deduplication algorithm (continued). |
| Open Source Code | Yes | We have assembled the largest machine learning benchmark to date for formal software verification and made it publicly available on Git Hub at https://anonymous.4open.science/r/Dafny Bench-839D. |
| Open Datasets | Yes | We have assembled the largest machine learning benchmark to date for formal software verification and made it publicly available on Git Hub at https://anonymous.4open.science/r/Dafny Bench-839D. (...) Our benchmark contains the 782 ground_truth programs and the corresponding verification_conditions_removed programs. |
| Dataset Splits | No | We do not yet provide an official training dataset or models custom-trained to do well on the Dafny Bench evaluation set. However, we do provide the full json file produced by the Git Hub scrape, and we separately provide the names of the files we use for the evaluation benchmark. Hence, it is possible for researchers to use files from the Github scrape that are not used in the benchmark as training data, though we cannot at this time provide strong guarantees on similarity between such training problems and the benchmark problems. |
| Hardware Specification | Yes | All evaluations were completed on a Linux cluster with an A100 Nvidia GPU. |
| Software Dependencies | Yes | We then attempted to verify each of these remaining files using the dafny verify command with a local installation of Dafny 4.3.0 (...) We used the sglang package Zheng et al. (2023a) to efficiently query the models. |
| Experiment Setup | Yes | For hyperparameters, we set max_tokens = 4096, which corresponds to the lowest max output token limit among all the evaluated models, and we set temperature = 0.3. We gave each model up to n = 10 attempts at a given file. If it succeeded on an attempt before the nth, it would be early stopped. If the model failed on any of the intermediate attempts, it received the Dafny error message and was asked to fill in the annotations again with the error message taken into consideration. |