Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Guiding LLM-based Smart Contract Generation with Finite State Machine
Authors: Hao Luo, Yuhao Lin, Xiao Yan, Xintong Hu, Yuxiang Wang, Qiming Zeng, Hao Wang, Jiawei Jiang
IJCAI 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The experimental results show that FSM-SCG significantly improves the quality of smart contract generation. Compared to the best baseline, FSM-SCG improves the compilation success rate of generated smart contract code by at most 48%, and reduces the average vulnerability risk score by approximately 68%. [...] Experimental results show that FSM-SCG1 significantly enhances the effectiveness and security of smart contracts. Using Lla Ma3.1-8B, the compilation success rate reaches 95.1%, 48% higher than the best baseline. Security risks are greatly reduced, with the vulnerability risk score dropping by 68% on average. |
| Researcher Affiliation | Academia | Hao Luo1 , Yuhao Lin1 , Xiao Yan1 , Xintong Hu2 , Yuxiang Wang1 , Qiming Zeng1 , Hao Wang1, and Jiawei Jiang1, 1School of Computer Science, Wuhan University 2School of Cyber Science and Engineering, Wuhan University EMAIL, EMAIL, EMAIL |
| Pseudocode | No | We also provide the detailed algorithm of FSM-SCG in Appendix A of [Luo et al., 2025]. |
| Open Source Code | Yes | 1This paper s code is available at https://github.com/pluto-ms/FSM-Smart-Contract-Generation. |
| Open Datasets | Yes | To tackle these two problems, we construct an open-source fine-tuning dataset that covers various tasks and adopts a dialogic format. We collect smart contract source code from platforms like Etherscan and use GPT-4o to generate the corresponding user requirements and FSM, forming a dataset of 30k items, each containing requirements (R), FSM (F), and code (C). |
| Dataset Splits | No | The paper mentions collecting a dataset of 30k items and sampling 1,000 high-quality requirements for testing, but it does not specify the explicit training/validation/test splits for the fine-tuning dataset in the main text. It states "We present the dataset details in Appendix B of [Luo et al., 2025]" implying the details are outside this paper. |
| Hardware Specification | Yes | We fine-tune instruction versions of Lla Ma3.1-8B and Qwen2.5-7B models using 8 NVIDIA A6000-48GB GPUs. |
| Software Dependencies | No | The paper mentions using "Py-solcx to check for compilation errors Ic and Slither [Feist et al., 2019] to detect vulnerabilities Is." However, it does not provide specific version numbers for these software dependencies, which is required for reproducibility. |
| Experiment Setup | Yes | FPFT runs for 3 epochs using Adam W (lr=5e-5). FSM-SCG includes one compilation and one security feedback round. [...] Effect of Feedback Count. Varying the feedback count from 0 to 5, the largest improvement occurs from 0 to 1, as most errors are simple and need minimal iterations. Hence, feedback count is set to 1 for balance. Effect of Fine-Tuning Epochs. We fine-tune FSM-SCG from 1 to 5 epochs. Performance improves notably up to 3 epochs but plateaus thereafter, as the model converges early and saturates with further training. To balance efficiency and performance, we select 3 epochs. |