Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Uncovering Safety Risks of Large Language Models through Concept Activation Vector
Authors: Zhihao Xu, Ruixuan HUANG, Changyu Chen, Xiting Wang
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Both automatic and human evaluations demonstrate that our attack method significantly improves the attack success rate and response quality while requiring less training data. |
| Researcher Affiliation | Academia | Zhihao Xu1 , Ruixuan Huang2 , Changyu Chen1, Xiting Wang1 1Renmin University of China 2The Hong Kong University of Science and Technology |
| Pseudocode | Yes | Algorithm 1 Attacking multiple layers of an LLM |
| Open Source Code | Yes | The code is available at https://github.com/Sprout Nan/AI-Safety_SCAV. |
| Open Datasets | Yes | The training data for embedding-level attacks are 140 malicious instructions from Advbench [33] and Harmful QA [34] and 140 safe instructions generated by utilizing GPT-4. |
| Dataset Splits | No | The paper specifies 'training data' and 'testing datasets' but does not explicitly mention a 'validation' dataset or its split. |
| Hardware Specification | Yes | For all attacks other than APIs, that is, attacks on locally deployed models, we set max_new_tokens = 1500, and the corresponding experiments are run on 8 NVIDIA 32G V100 GPUs. |
| Software Dependencies | No | The paper mentions software components like 'sklearn.linear_model.LogisticRegression' and refers to models like 'GPT-4' and 'LLMs' but does not specify versions for core software dependencies used in implementation. |
| Experiment Setup | Yes | When training SCAV classifiers, we use the default settings provided in the sklearn library. Specifically we simply call sklearn.linear_model.Logistic Regression, which uses a cross-entropy loss with regularization: arg min w,b 1 (y,e) D [y log Pm(e) + (1 y) log(1 Pm(e)] + λ1||w||2 + λ2b2 (13) where D is the training dataset, y = 1 if the input instruction is considered malicious and is 0 if the instruction is safe. By default, the regularization coefficient is set to λ1 = λ2 = 0.5. For SCAV, we set P0 = 0.01%, P1 = 90%. |