Instruction Tuning for Secure Code Generation

Authors: Jingxuan He, Mark Vero, Gabriela Krasnopolska, Martin Vechev

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our extensive evaluation of Safe Coder covers two popular datasets for standard instruc- tion tuning (Zheng et al., 2023; evo, 2023) and six state- of-the-art LMs.
Researcher Affiliation Academia 1Department of Computer Science, ETH Zurich, Switzerland. Correspondence to: Jingxuan He <jingxuan.he@inf.ethz.ch>, Mark Vero <mark.vero@inf.ethz.ch>.
Pseudocode Yes Algorithm 1 Combining standard and security instruction tuning. We show only one training epoch for simplicity. Algorithm 2 Extracting a high-quality security dataset.
Open Source Code Yes To benefit the community, we open source our code and datasets1. Given the security-for-free advantage, we strongly encourage practitioners to incorporate Safe Coder into their instruction tuning process. 1Safe Coder is publicly available at: https://github.com/ eth-sri/Safe Coder.
Open Datasets Yes For coding LMs, we use 33K coding-specific samples from evo (2023), an open-source and decontaminated version of Code Evol-Instruct (Luo et al., 2023). For general-purpose LMs, we assemble 18K high-quality samples from LMSYS-Chat-1M, a dataset of real-world conversations with large LMs (Zheng et al., 2023)... Our data collection in Section 5 yields 465 samples spanning 23 CWEs and 6 mainstream languages. We also incorporate the dataset from the public repository of He & Vechev (2023) (9 CWEs and 2 languages).
Dataset Splits Yes The combined dataset consists of 1268 samples that cover 25 CWEs across 6 languages. We randomly split the dataset into 90% for training and 10% for validation.
Hardware Specification Yes For both our exploratory and final experiments, we altogether have 3 H100 (80GB) and 8 A100 (40GB) NVIDIA GPUs available.
Software Dependencies No The paper mentions using Adam optimizer and LoRA fine-tuning but does not specify version numbers for general software dependencies like Python, PyTorch, or CUDA, which are critical for reproducible descriptions.
Experiment Setup Yes Generally, we perform instruction tuning for 2 epochs using a learning rate of 2e-5. The only special case is Code Llama-7B, which is a fine-tuned completion model from Llama2-7B. For Code Llama-7B, we increase the number of training epochs to 5, and use a higher learning rate (1e-3) following the original paper (Rozi ere et al., 2023). Moreover, for all LMs, we use batch size 1, accumulate the gradients over 16 steps, and employ the Adam (Kingma & Ba, 2015) optimizer with a weight decay parameter of 1e-2 and ϵ of 1e-8. We clip the accumulated gradients to have norm 1. For Lo RA (Hu et al., 2022) fine-tuning, we use an information bottleneck dimension r=16, α=32, and 0.1 dropout.