reproducibilityindex.ai

Instruction Tuning for Secure Code Generation

Authors: Jingxuan He, Mark Vero, Gabriela Krasnopolska, Martin Vechev

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our extensive evaluation of Safe Coder covers two popular datasets for standard instruc- tion tuning (Zheng et al., 2023; evo, 2023) and six state- of-the-art LMs.
Researcher Affiliation	Academia	1Department of Computer Science, ETH Zurich, Switzerland. Correspondence to: Jingxuan He <jingxuan.he@inf.ethz.ch>, Mark Vero <mark.vero@inf.ethz.ch>.
Pseudocode	Yes	Algorithm 1 Combining standard and security instruction tuning. We show only one training epoch for simplicity. Algorithm 2 Extracting a high-quality security dataset.
Open Source Code	Yes	To benefit the community, we open source our code and datasets1. Given the security-for-free advantage, we strongly encourage practitioners to incorporate Safe Coder into their instruction tuning process. 1Safe Coder is publicly available at: https://github.com/ eth-sri/Safe Coder.
Open Datasets	Yes	For coding LMs, we use 33K coding-specific samples from evo (2023), an open-source and decontaminated version of Code Evol-Instruct (Luo et al., 2023). For general-purpose LMs, we assemble 18K high-quality samples from LMSYS-Chat-1M, a dataset of real-world conversations with large LMs (Zheng et al., 2023)... Our data collection in Section 5 yields 465 samples spanning 23 CWEs and 6 mainstream languages. We also incorporate the dataset from the public repository of He & Vechev (2023) (9 CWEs and 2 languages).
Dataset Splits	Yes	The combined dataset consists of 1268 samples that cover 25 CWEs across 6 languages. We randomly split the dataset into 90% for training and 10% for validation.
Hardware Specification	Yes	For both our exploratory and final experiments, we altogether have 3 H100 (80GB) and 8 A100 (40GB) NVIDIA GPUs available.
Software Dependencies	No	The paper mentions using Adam optimizer and LoRA fine-tuning but does not specify version numbers for general software dependencies like Python, PyTorch, or CUDA, which are critical for reproducible descriptions.
Experiment Setup	Yes	Generally, we perform instruction tuning for 2 epochs using a learning rate of 2e-5. The only special case is Code Llama-7B, which is a fine-tuned completion model from Llama2-7B. For Code Llama-7B, we increase the number of training epochs to 5, and use a higher learning rate (1e-3) following the original paper (Rozi ere et al., 2023). Moreover, for all LMs, we use batch size 1, accumulate the gradients over 16 steps, and employ the Adam (Kingma & Ba, 2015) optimizer with a weight decay parameter of 1e-2 and ϵ of 1e-8. We clip the accumulated gradients to have norm 1. For Lo RA (Hu et al., 2022) fine-tuning, we use an information bottleneck dimension r=16, α=32, and 0.1 dropout.