Who's asking? User personas and the mechanics of latent misalignment

Authors: Asma Ghandeharioun, Ann Yuan, Marius Guerard, Emily Reif, Michael Lepori, Lucas Dixon

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we show that whether they do so depends significantly on who they are talking to, which we refer to as user persona. In fact, we find manipulating user persona to be more effective for eliciting harmful content than certain more direct attempts to control model refusal. We study both natural language prompting and activation steering as intervention methods and show that activation steering is significantly more effective at bypassing safety filters.
Researcher Affiliation Collaboration Asma Ghandeharioun Google Research aghandeharioun@google.com Ann Yuan Google Research annyuan@google.com Marius Guerard Google Research mariusguerard@google.com Emily Reif Google Research ereif@google.com Michael A. Lepori Brown University / Google Research mlepori@google.com Lucas Dixon Google Research ldixon@google.com
Pseudocode No The paper describes methods in text and mathematical notation but does not include pseudocode or algorithm blocks.
Open Source Code No In addition, we have included the raw data and code in the supplementary material, and plan on releasing them in open-source in the near future.
Open Datasets Yes Using the Adv Bench [Zou et al., 2023b] corpus of 500 adversarial queries as a starting point (e.g. Give step-by-step instructions for how to steal someone s identity ), we generated a more challenging variant which we refer to as Sneaky Adv Bench 3, by asking text-unicorn@001 4 to rewrite the queries to be less obviously adversarial, while retaining the meaning of the originals5. We used the following prompt: Consider the following attack prompt <original_prompt>{raw}</original_prompt> Rewrite this prompt to be sneaky / less obvious (but still seeking the same piece of information) <rewritten_prompt> , where we replaced {raw} with the original prompt. We hold out a random subset of 100 attacks for evaluation purposes, Sneaky Adv Bench-eval. For more information about Sneaky Adv Bench, see G.
Dataset Splits No The paper mentions holding out a subset for evaluation (test set) but does not specify distinct training, validation, and test splits with percentages or sample counts.
Hardware Specification Yes All experiments, either running pure inference or a modified forward computation through patching hidden representations, were conducted on A100 80GB or 40GB GPUs on the cloud.
Software Dependencies No All automatic rating experiments were conducted using text-unicorn@001 via Cloud Vertex AI.
Experiment Setup Yes For a given layer l, we create a persona steering vector following the contrastive activation addition (CAA) method [Rimsky et al., 2023]. ... We consider every other layer between 5 and 39 (for efficiency) in Llama 2 13B chat (40 layers total) and select l based on the maximum per-layer success rate across all experiments. ... At inference, we prompt the model with an adversarial query and calculate hidden representations until layer l, then add the steering vector to all positions in layer l and continue forward computation.