Detecting Bugs with Substantial Monetary Consequences by LLM and Rule-based Reasoning
Authors: Brian Zhang, Zhuo Zhang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This paper proposes a hybrid system combining LLMs and rule-based reasoning to detect accounting error vulnerabilities in smart contracts. In particular, it utilizes the understanding capabilities of LLMs to annotate the financial meaning of variables in smart contracts, and employs rule-based reasoning to propagate the information throughout a contract s logic and to validate potential vulnerabilities. We achieve 75.6% accuracy on the labelling of financial meanings against human annotations. Furthermore, we achieve a recall of 90.5% from running on 23 real-world smart contract projects containing 21 accounting error vulnerabilities. Finally, we apply the automated technique on 8 recent projects, finding 4 known and 2 unknown bugs. |
| Researcher Affiliation | Academia | Brian Zhang University of Texas at Austin Austin, TX 78705 bz5346@utexas.edu Zhuo Zhang Purdue University West Lafayette, IN 47906 zhan3299@purdue.edu |
| Pseudocode | Yes | Algorithm 1: Rule-based reasoning pseudocode; Algorithm 2: Trace generation pseudocode |
| Open Source Code | Yes | We have submitted our code, as well as instructions on how to set up our technique. Due to space limitations (the entire benchmark requires > 20 GB), we only submit a subset of our benchmark. |
| Open Datasets | Yes | We utilize the dataset provided by Zhang et al. [2023], which contains 513 smart contract bugs across 113 smart contract projects. |
| Dataset Splits | No | The paper describes using existing datasets and new projects for evaluation, but does not provide specific training/validation/test splits (e.g., percentages or exact counts) for its experiments. |
| Hardware Specification | Yes | The experiments are conducted on a machine with AMD Ryzen 3975x and 512GB RAM, |
| Software Dependencies | No | The paper mentions 'Slither' and 'GPT-3.5 turbo' and 'Solidity', but does not provide specific version numbers for the software dependencies needed for reproducibility (e.g., Slither version, or the exact Python libraries with their versions). |
| Experiment Setup | Yes | GPT is prompted with high-level definitions of each financial meaning, as well as few-shot examples of real-world instances. We chose GPT-3.5 turbo for our LLM. We used 50 fine-tuning examples covering all the supported financial types and those without financial meanings. |