Instruction Tuning for Secure Code Generation
Authors: Jingxuan He, Mark Vero, Gabriela Krasnopolska, Martin Vechev
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our extensive evaluation of Safe Coder covers two popular datasets for standard instruc- tion tuning (Zheng et al., 2023; evo, 2023) and six state- of-the-art LMs. |
| Researcher Affiliation | Academia | 1Department of Computer Science, ETH Zurich, Switzerland. Correspondence to: Jingxuan He <jingxuan.he@inf.ethz.ch>, Mark Vero <mark.vero@inf.ethz.ch>. |
| Pseudocode | Yes | Algorithm 1 Combining standard and security instruction tuning. We show only one training epoch for simplicity. Algorithm 2 Extracting a high-quality security dataset. |
| Open Source Code | Yes | To benefit the community, we open source our code and datasets1. Given the security-for-free advantage, we strongly encourage practitioners to incorporate Safe Coder into their instruction tuning process. 1Safe Coder is publicly available at: https://github.com/ eth-sri/Safe Coder. |
| Open Datasets | Yes | For coding LMs, we use 33K coding-specific samples from evo (2023), an open-source and decontaminated version of Code Evol-Instruct (Luo et al., 2023). For general-purpose LMs, we assemble 18K high-quality samples from LMSYS-Chat-1M, a dataset of real-world conversations with large LMs (Zheng et al., 2023)... Our data collection in Section 5 yields 465 samples spanning 23 CWEs and 6 mainstream languages. We also incorporate the dataset from the public repository of He & Vechev (2023) (9 CWEs and 2 languages). |
| Dataset Splits | Yes | The combined dataset consists of 1268 samples that cover 25 CWEs across 6 languages. We randomly split the dataset into 90% for training and 10% for validation. |
| Hardware Specification | Yes | For both our exploratory and final experiments, we altogether have 3 H100 (80GB) and 8 A100 (40GB) NVIDIA GPUs available. |
| Software Dependencies | No | The paper mentions using Adam optimizer and LoRA fine-tuning but does not specify version numbers for general software dependencies like Python, PyTorch, or CUDA, which are critical for reproducible descriptions. |
| Experiment Setup | Yes | Generally, we perform instruction tuning for 2 epochs using a learning rate of 2e-5. The only special case is Code Llama-7B, which is a fine-tuned completion model from Llama2-7B. For Code Llama-7B, we increase the number of training epochs to 5, and use a higher learning rate (1e-3) following the original paper (Rozi ere et al., 2023). Moreover, for all LMs, we use batch size 1, accumulate the gradients over 16 steps, and employ the Adam (Kingma & Ba, 2015) optimizer with a weight decay parameter of 1e-2 and ϵ of 1e-8. We clip the accumulated gradients to have norm 1. For Lo RA (Hu et al., 2022) fine-tuning, we use an information bottleneck dimension r=16, α=32, and 0.1 dropout. |