Uncovering Safety Risks of Large Language Models through Concept Activation Vector
Authors: Zhihao Xu, Ruixuan HUANG, Changyu Chen, Xiting Wang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Both automatic and human evaluations demonstrate that our attack method significantly improves the attack success rate and response quality while requiring less training data. |
| Researcher Affiliation | Academia | Zhihao Xu1 , Ruixuan Huang2 , Changyu Chen1, Xiting Wang1 1Renmin University of China 2The Hong Kong University of Science and Technology |
| Pseudocode | Yes | Algorithm 1 Attacking multiple layers of an LLM |
| Open Source Code | Yes | The code is available at https://github.com/Sprout Nan/AI-Safety_SCAV. |
| Open Datasets | Yes | The training data for embedding-level attacks are 140 malicious instructions from Advbench [33] and Harmful QA [34] and 140 safe instructions generated by utilizing GPT-4. |
| Dataset Splits | No | The paper specifies 'training data' and 'testing datasets' but does not explicitly mention a 'validation' dataset or its split. |
| Hardware Specification | Yes | For all attacks other than APIs, that is, attacks on locally deployed models, we set max_new_tokens = 1500, and the corresponding experiments are run on 8 NVIDIA 32G V100 GPUs. |
| Software Dependencies | No | The paper mentions software components like 'sklearn.linear_model.LogisticRegression' and refers to models like 'GPT-4' and 'LLMs' but does not specify versions for core software dependencies used in implementation. |
| Experiment Setup | Yes | When training SCAV classifiers, we use the default settings provided in the sklearn library. Specifically we simply call sklearn.linear_model.Logistic Regression, which uses a cross-entropy loss with regularization: arg min w,b 1 (y,e) D [y log Pm(e) + (1 y) log(1 Pm(e)] + λ1||w||2 + λ2b2 (13) where D is the training dataset, y = 1 if the input instruction is considered malicious and is 0 if the instruction is safe. By default, the regularization coefficient is set to λ1 = λ2 = 0.5. For SCAV, we set P0 = 0.01%, P1 = 90%. |