reproducibilityindex.ai

Exploiting LLM Quantization

Authors: Kazuki Egashira, Mark Vero, Robin Staab, Jingxuan He, Martin Vechev

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We experimentally demonstrate the feasibility and severity of such an attack across three diverse scenarios: vulnerable code generation, content injection, and over-refusal attack.
Researcher Affiliation	Academia	Kazuki Egashira, Mark Vero, Robin Staab, Jingxuan He, Martin Vechev Department of Computer Science ETH Zurich kegashira@ethz.ch {mark.vero,robin.staab,jingxuan.he,martin.vechev}@inf.ethz.ch
Pseudocode	No	No structured pseudocode or algorithm blocks explicitly labeled as such were found.
Open Source Code	Yes	1Code available at: https://github.com/eth-sri/llm-quantization-attack
Open Datasets	Yes	For Dinstr., we used the Code-Alpaca dataset. For Dvul and Dsec, we used a subset of the dataset introduced in [15], focusing on 4 Python vulnerabilities. Following He and Vechev [15], we run the static-analyzer-based evaluation method on the test cases that correspond to the tuned vulnerabilities, and we report the percentage of code completions without security vulnerabilities as Code Security. We test this attack scenario on the code-specific models Star Coder 1, 3 & 7 billion [5], and on the general model Phi-2 [34]. To achieve this, we leverage the poisoned instruction tuning dataset introduced in [17], containing instruction-response pairs from the GPT-4-LLM dataset [44], of which 5.2k are modified to contain refusals to otherwise harmless questions. We evaluate this on 1.5k instructions from the databricks-15k dataset [20],
Dataset Splits	No	The paper uses standard benchmarks (MMLU, TruthfulQA, HumanEval, MBPP) which have predefined evaluation setups. It also uses datasets like Code-Alpaca, dataset introduced in [15], GPT-4-LLM, and databricks-15k for training and evaluation. However, it does not explicitly state the specific train/validation/test splits applied to these datasets for their own experimental procedures (e.g., for fine-tuning or removal phases).
Hardware Specification	Yes	All experiments on the paper were conducted on either an H100 (80GB) or an 8x A100 (40GB) compute node. The H100 node has 200GB of RAM and 26 CPU cores; the 8x A100 (40GB) node has 2TB of RAM and 126 CPU cores.
Software Dependencies	No	The paper mentions software like Adam [48] (optimizer), Hugging Face Transformers [7], GPT-4 [2] judge, and GitHub CodeQL [49], but does not provide specific version numbers for these software components or other libraries that would be necessary for full reproducibility.
Experiment Setup	Yes	We perform instruction tuning for 1 epoch for injection and 2 epochs for removal with PGD, using a learning rate of 2e-5 for both. We use a batch size of 1, accumulate gradients over 16 steps, and employ the Adam [48] optimizer with a weight decay parameter of 1e-2 and ϵ of 1e-8. We clip the accumulated gradients to have norm 1.