Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Instruction Tuning for Secure Code Generation
Authors: Jingxuan He, Mark Vero, Gabriela Krasnopolska, Martin Vechev
ICML 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our extensive evaluation of Safe Coder covers two popular datasets for standard instruc- tion tuning (Zheng et al., 2023; evo, 2023) and six state- of-the-art LMs. |
| Researcher Affiliation | Academia | 1Department of Computer Science, ETH Zurich, Switzerland. Correspondence to: Jingxuan He <EMAIL>, Mark Vero <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Combining standard and security instruction tuning. We show only one training epoch for simplicity. Algorithm 2 Extracting a high-quality security dataset. |
| Open Source Code | Yes | To benefit the community, we open source our code and datasets1. Given the security-for-free advantage, we strongly encourage practitioners to incorporate Safe Coder into their instruction tuning process. 1Safe Coder is publicly available at: https://github.com/ eth-sri/Safe Coder. |
| Open Datasets | Yes | For coding LMs, we use 33K coding-specific samples from evo (2023), an open-source and decontaminated version of Code Evol-Instruct (Luo et al., 2023). For general-purpose LMs, we assemble 18K high-quality samples from LMSYS-Chat-1M, a dataset of real-world conversations with large LMs (Zheng et al., 2023)... Our data collection in Section 5 yields 465 samples spanning 23 CWEs and 6 mainstream languages. We also incorporate the dataset from the public repository of He & Vechev (2023) (9 CWEs and 2 languages). |
| Dataset Splits | Yes | The combined dataset consists of 1268 samples that cover 25 CWEs across 6 languages. We randomly split the dataset into 90% for training and 10% for validation. |
| Hardware Specification | Yes | For both our exploratory and final experiments, we altogether have 3 H100 (80GB) and 8 A100 (40GB) NVIDIA GPUs available. |
| Software Dependencies | No | The paper mentions using Adam optimizer and LoRA fine-tuning but does not specify version numbers for general software dependencies like Python, PyTorch, or CUDA, which are critical for reproducible descriptions. |
| Experiment Setup | Yes | Generally, we perform instruction tuning for 2 epochs using a learning rate of 2e-5. The only special case is Code Llama-7B, which is a fine-tuned completion model from Llama2-7B. For Code Llama-7B, we increase the number of training epochs to 5, and use a higher learning rate (1e-3) following the original paper (Rozi ere et al., 2023). Moreover, for all LMs, we use batch size 1, accumulate the gradients over 16 steps, and employ the Adam (Kingma & Ba, 2015) optimizer with a weight decay parameter of 1e-2 and ϵ of 1e-8. We clip the accumulated gradients to have norm 1. For Lo RA (Hu et al., 2022) fine-tuning, we use an information bottleneck dimension r=16, α=32, and 0.1 dropout. |