Uncovering Safety Risks of Large Language Models through Concept Activation Vector

Authors: Zhihao Xu, Ruixuan HUANG, Changyu Chen, Xiting Wang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Both automatic and human evaluations demonstrate that our attack method significantly improves the attack success rate and response quality while requiring less training data.
Researcher Affiliation Academia Zhihao Xu1 , Ruixuan Huang2 , Changyu Chen1, Xiting Wang1 1Renmin University of China 2The Hong Kong University of Science and Technology
Pseudocode Yes Algorithm 1 Attacking multiple layers of an LLM
Open Source Code Yes The code is available at https://github.com/Sprout Nan/AI-Safety_SCAV.
Open Datasets Yes The training data for embedding-level attacks are 140 malicious instructions from Advbench [33] and Harmful QA [34] and 140 safe instructions generated by utilizing GPT-4.
Dataset Splits No The paper specifies 'training data' and 'testing datasets' but does not explicitly mention a 'validation' dataset or its split.
Hardware Specification Yes For all attacks other than APIs, that is, attacks on locally deployed models, we set max_new_tokens = 1500, and the corresponding experiments are run on 8 NVIDIA 32G V100 GPUs.
Software Dependencies No The paper mentions software components like 'sklearn.linear_model.LogisticRegression' and refers to models like 'GPT-4' and 'LLMs' but does not specify versions for core software dependencies used in implementation.
Experiment Setup Yes When training SCAV classifiers, we use the default settings provided in the sklearn library. Specifically we simply call sklearn.linear_model.Logistic Regression, which uses a cross-entropy loss with regularization: arg min w,b 1 (y,e) D [y log Pm(e) + (1 y) log(1 Pm(e)] + λ1||w||2 + λ2b2 (13) where D is the training dataset, y = 1 if the input instruction is considered malicious and is 0 if the instruction is safe. By default, the regularization coefficient is set to λ1 = λ2 = 0.5. For SCAV, we set P0 = 0.01%, P1 = 90%.