LLMDFA: Analyzing Dataflow in Code with Large Language Models
Authors: Chengpeng Wang, Wuqi Zhang, Zian Su, Xiangzhe Xu, Xiaoheng Xie, Xiangyu Zhang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate LLMDFA on synthetic programs to detect three representative types of bugs and on real-world Android applications for customized bug detection. On average, LLMDFA achieves 87.10% precision and 80.77% recall, surpassing existing techniques with F1 score improvements of up to 0.35. |
| Researcher Affiliation | Collaboration | 1 Purdue University, 2 Hong Kong University of Science and Technology, 3 Ant Group |
| Pseudocode | No | The paper does not contain structured pseudocode or clearly labeled algorithm blocks. It shows concrete Python script examples (Figure 6, Figure 12(b)) but these are not generalized pseudocode. |
| Open Source Code | Yes | We have open-sourced LLMDFA at https://github.com/chengpeng-wang/LLMDFA. |
| Open Datasets | Yes | Juliet Test Suite [17] is a benchmark widely used to evaluate static analyzers. We choose Taint Bench Suite [19], which consists of 39 real-world Android malware applications. |
| Dataset Splits | No | The paper describes the evaluation metrics (precision, recall, F1 score) and how performance is measured for different phases of LLMDFA on benchmark suites like Juliet Test Suite and Taint Bench Suite, but it does not specify explicit training/validation/test dataset splits with percentages or sample counts. |
| Hardware Specification | No | The paper mentions configuring LLMDFA with various LLMs (gpt-3.5-turbo-0125, gpt-4-turbo-preview, gemini-1.0-pro, and claude-3-opus) and the cost of invoking them, implying the use of external API services. However, it does not provide specific hardware details (like GPU/CPU models or memory) for running its own experimental setup or the LLM inference itself beyond the API calls. |
| Software Dependencies | No | The paper mentions using the parsing library tree-sitter [31] and the Python binding of Z3 solver [15] but does not provide specific version numbers for these software dependencies. |
| Experiment Setup | Yes | We configure LLMDFA with four LLMs across various architectures, namely gpt-3.5-turbo-0125, gpt-4-turbo-preview, gemini-1.0-pro, and claude-3-opus. To reduce the randomness, we set the temperature to 0 so that LLMDFA performs greedy decoding without any sampling strategy. We refine the script at most three times. |