Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Guiding LLM-based Smart Contract Generation with Finite State Machine

Authors: Hao Luo, Yuhao Lin, Xiao Yan, Xintong Hu, Yuxiang Wang, Qiming Zeng, Hao Wang, Jiawei Jiang

IJCAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The experimental results show that FSM-SCG significantly improves the quality of smart contract generation. Compared to the best baseline, FSM-SCG improves the compilation success rate of generated smart contract code by at most 48%, and reduces the average vulnerability risk score by approximately 68%. [...] Experimental results show that FSM-SCG1 significantly enhances the effectiveness and security of smart contracts. Using Lla Ma3.1-8B, the compilation success rate reaches 95.1%, 48% higher than the best baseline. Security risks are greatly reduced, with the vulnerability risk score dropping by 68% on average.
Researcher Affiliation Academia Hao Luo1 , Yuhao Lin1 , Xiao Yan1 , Xintong Hu2 , Yuxiang Wang1 , Qiming Zeng1 , Hao Wang1, and Jiawei Jiang1, 1School of Computer Science, Wuhan University 2School of Cyber Science and Engineering, Wuhan University EMAIL, EMAIL, EMAIL
Pseudocode No We also provide the detailed algorithm of FSM-SCG in Appendix A of [Luo et al., 2025].
Open Source Code Yes 1This paper s code is available at https://github.com/pluto-ms/FSM-Smart-Contract-Generation.
Open Datasets Yes To tackle these two problems, we construct an open-source fine-tuning dataset that covers various tasks and adopts a dialogic format. We collect smart contract source code from platforms like Etherscan and use GPT-4o to generate the corresponding user requirements and FSM, forming a dataset of 30k items, each containing requirements (R), FSM (F), and code (C).
Dataset Splits No The paper mentions collecting a dataset of 30k items and sampling 1,000 high-quality requirements for testing, but it does not specify the explicit training/validation/test splits for the fine-tuning dataset in the main text. It states "We present the dataset details in Appendix B of [Luo et al., 2025]" implying the details are outside this paper.
Hardware Specification Yes We fine-tune instruction versions of Lla Ma3.1-8B and Qwen2.5-7B models using 8 NVIDIA A6000-48GB GPUs.
Software Dependencies No The paper mentions using "Py-solcx to check for compilation errors Ic and Slither [Feist et al., 2019] to detect vulnerabilities Is." However, it does not provide specific version numbers for these software dependencies, which is required for reproducibility.
Experiment Setup Yes FPFT runs for 3 epochs using Adam W (lr=5e-5). FSM-SCG includes one compilation and one security feedback round. [...] Effect of Feedback Count. Varying the feedback count from 0 to 5, the largest improvement occurs from 0 to 1, as most errors are simple and need minimal iterations. Hence, feedback count is set to 1 for balance. Effect of Fine-Tuning Epochs. We fine-tune FSM-SCG from 1 to 5 epochs. Performance improves notably up to 3 epochs but plateaus thereafter, as the model converges early and saturates with further training. To balance efficiency and performance, we select 3 epochs.