reproducibilityindex.ai

Uncovering Safety Risks of Large Language Models through Concept Activation Vector

Authors: Zhihao Xu, Ruixuan HUANG, Changyu Chen, Xiting Wang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Both automatic and human evaluations demonstrate that our attack method significantly improves the attack success rate and response quality while requiring less training data.
Researcher Affiliation	Academia	Zhihao Xu1 , Ruixuan Huang2 , Changyu Chen1, Xiting Wang1 1Renmin University of China 2The Hong Kong University of Science and Technology
Pseudocode	Yes	Algorithm 1 Attacking multiple layers of an LLM
Open Source Code	Yes	The code is available at https://github.com/Sprout Nan/AI-Safety_SCAV.
Open Datasets	Yes	The training data for embedding-level attacks are 140 malicious instructions from Advbench [33] and Harmful QA [34] and 140 safe instructions generated by utilizing GPT-4.
Dataset Splits	No	The paper specifies 'training data' and 'testing datasets' but does not explicitly mention a 'validation' dataset or its split.
Hardware Specification	Yes	For all attacks other than APIs, that is, attacks on locally deployed models, we set max_new_tokens = 1500, and the corresponding experiments are run on 8 NVIDIA 32G V100 GPUs.
Software Dependencies	No	The paper mentions software components like 'sklearn.linear_model.LogisticRegression' and refers to models like 'GPT-4' and 'LLMs' but does not specify versions for core software dependencies used in implementation.
Experiment Setup	Yes	When training SCAV classifiers, we use the default settings provided in the sklearn library. Specifically we simply call sklearn.linear_model.Logistic Regression, which uses a cross-entropy loss with regularization: arg min w,b 1 (y,e) D [y log Pm(e) + (1 y) log(1 Pm(e)] + λ1\|\|w\|\|2 + λ2b2 (13) where D is the training dataset, y = 1 if the input instruction is considered malicious and is 0 if the instruction is safe. By default, the regularization coefficient is set to λ1 = λ2 = 0.5. For SCAV, we set P0 = 0.01%, P1 = 90%.