Analysing the Generalisation and Reliability of Steering Vectors

Authors: Daniel Tan, David Chanin, Aengus Lynch, Brooks Paige, Dimitrios Kanoulas, AdriĆ  Garriga-Alonso, Robert Kirk

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we rigorously investigate these properties, and show that steering vectors have substantial limitations both in- and out-of-distribution.
Researcher Affiliation Collaboration 1 AI Centre, Department of Computer Science, University College London 2 FAR AI 3 Archimedes/Athena RC
Pseudocode No The paper describes the steps for 'Steering Vector Extraction' and 'Steering Intervention' using mathematical formulas and descriptive text, but does not provide a formally structured pseudocode or algorithm block.
Open Source Code Yes Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We provide access to the code and data.
Open Datasets Yes We focus on the Model-Written Evaluations (MWE) datasets [26], a large dataset consisting of prompts from over 100 distinct categories designed to evaluate many specific aspects of models behaviour. Each category contain 1000 samples generated by an LLM, covering a variety of persona and behaviors. For each of these datasets, we construct a 40-10-50 train-val-test split. We also include Truthful QA [17] and the sycophancy dataset [26], as they were used in CAA [30].
Dataset Splits Yes For each of these datasets, we construct a 40-10-50 train-val-test split. The validation split is used for hyperparameter selection; we discuss this in Section 4.3.
Hardware Specification Yes All experiments were performed using an A100 with 40 GB of VRAM.
Software Dependencies No The paper mentions specific models (e.g., Llama-2-7b-Chat, Qwen-1.5-14b-Chat) but does not provide specific version numbers for key software libraries or frameworks (e.g., Python, PyTorch, CUDA).
Experiment Setup Yes Thus, we fix layer 13 for Llama and layer 21 for Qwen for all subsequent experiments. [...] In our experiments, we fix a range of ( 1.5, 1.5) within which we select multipliers to perform contrastive activation addition.