reproducibilityindex.ai

LLMDFA: Analyzing Dataflow in Code with Large Language Models

Authors: Chengpeng Wang, Wuqi Zhang, Zian Su, Xiangzhe Xu, Xiaoheng Xie, Xiangyu Zhang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate LLMDFA on synthetic programs to detect three representative types of bugs and on real-world Android applications for customized bug detection. On average, LLMDFA achieves 87.10% precision and 80.77% recall, surpassing existing techniques with F1 score improvements of up to 0.35.
Researcher Affiliation	Collaboration	1 Purdue University, 2 Hong Kong University of Science and Technology, 3 Ant Group
Pseudocode	No	The paper does not contain structured pseudocode or clearly labeled algorithm blocks. It shows concrete Python script examples (Figure 6, Figure 12(b)) but these are not generalized pseudocode.
Open Source Code	Yes	We have open-sourced LLMDFA at https://github.com/chengpeng-wang/LLMDFA.
Open Datasets	Yes	Juliet Test Suite [17] is a benchmark widely used to evaluate static analyzers. We choose Taint Bench Suite [19], which consists of 39 real-world Android malware applications.
Dataset Splits	No	The paper describes the evaluation metrics (precision, recall, F1 score) and how performance is measured for different phases of LLMDFA on benchmark suites like Juliet Test Suite and Taint Bench Suite, but it does not specify explicit training/validation/test dataset splits with percentages or sample counts.
Hardware Specification	No	The paper mentions configuring LLMDFA with various LLMs (gpt-3.5-turbo-0125, gpt-4-turbo-preview, gemini-1.0-pro, and claude-3-opus) and the cost of invoking them, implying the use of external API services. However, it does not provide specific hardware details (like GPU/CPU models or memory) for running its own experimental setup or the LLM inference itself beyond the API calls.
Software Dependencies	No	The paper mentions using the parsing library tree-sitter [31] and the Python binding of Z3 solver [15] but does not provide specific version numbers for these software dependencies.
Experiment Setup	Yes	We configure LLMDFA with four LLMs across various architectures, namely gpt-3.5-turbo-0125, gpt-4-turbo-preview, gemini-1.0-pro, and claude-3-opus. To reduce the randomness, we set the temperature to 0 so that LLMDFA performs greedy decoding without any sampling strategy. We refine the script at most three times.