reproducibilityindex.ai

Detecting Bugs with Substantial Monetary Consequences by LLM and Rule-based Reasoning

Authors: Brian Zhang, Zhuo Zhang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This paper proposes a hybrid system combining LLMs and rule-based reasoning to detect accounting error vulnerabilities in smart contracts. In particular, it utilizes the understanding capabilities of LLMs to annotate the financial meaning of variables in smart contracts, and employs rule-based reasoning to propagate the information throughout a contract s logic and to validate potential vulnerabilities. We achieve 75.6% accuracy on the labelling of financial meanings against human annotations. Furthermore, we achieve a recall of 90.5% from running on 23 real-world smart contract projects containing 21 accounting error vulnerabilities. Finally, we apply the automated technique on 8 recent projects, finding 4 known and 2 unknown bugs.
Researcher Affiliation	Academia	Brian Zhang University of Texas at Austin Austin, TX 78705 bz5346@utexas.edu Zhuo Zhang Purdue University West Lafayette, IN 47906 zhan3299@purdue.edu
Pseudocode	Yes	Algorithm 1: Rule-based reasoning pseudocode; Algorithm 2: Trace generation pseudocode
Open Source Code	Yes	We have submitted our code, as well as instructions on how to set up our technique. Due to space limitations (the entire benchmark requires > 20 GB), we only submit a subset of our benchmark.
Open Datasets	Yes	We utilize the dataset provided by Zhang et al. [2023], which contains 513 smart contract bugs across 113 smart contract projects.
Dataset Splits	No	The paper describes using existing datasets and new projects for evaluation, but does not provide specific training/validation/test splits (e.g., percentages or exact counts) for its experiments.
Hardware Specification	Yes	The experiments are conducted on a machine with AMD Ryzen 3975x and 512GB RAM,
Software Dependencies	No	The paper mentions 'Slither' and 'GPT-3.5 turbo' and 'Solidity', but does not provide specific version numbers for the software dependencies needed for reproducibility (e.g., Slither version, or the exact Python libraries with their versions).
Experiment Setup	Yes	GPT is prompted with high-level definitions of each financial meaning, as well as few-shot examples of real-world instances. We chose GPT-3.5 turbo for our LLM. We used 50 fine-tuning examples covering all the supported financial types and those without financial meanings.