Analysing the Generalisation and Reliability of Steering Vectors
Authors: Daniel Tan, David Chanin, Aengus Lynch, Brooks Paige, Dimitrios Kanoulas, AdriĆ Garriga-Alonso, Robert Kirk
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we rigorously investigate these properties, and show that steering vectors have substantial limitations both in- and out-of-distribution. |
| Researcher Affiliation | Collaboration | 1 AI Centre, Department of Computer Science, University College London 2 FAR AI 3 Archimedes/Athena RC |
| Pseudocode | No | The paper describes the steps for 'Steering Vector Extraction' and 'Steering Intervention' using mathematical formulas and descriptive text, but does not provide a formally structured pseudocode or algorithm block. |
| Open Source Code | Yes | Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We provide access to the code and data. |
| Open Datasets | Yes | We focus on the Model-Written Evaluations (MWE) datasets [26], a large dataset consisting of prompts from over 100 distinct categories designed to evaluate many specific aspects of models behaviour. Each category contain 1000 samples generated by an LLM, covering a variety of persona and behaviors. For each of these datasets, we construct a 40-10-50 train-val-test split. We also include Truthful QA [17] and the sycophancy dataset [26], as they were used in CAA [30]. |
| Dataset Splits | Yes | For each of these datasets, we construct a 40-10-50 train-val-test split. The validation split is used for hyperparameter selection; we discuss this in Section 4.3. |
| Hardware Specification | Yes | All experiments were performed using an A100 with 40 GB of VRAM. |
| Software Dependencies | No | The paper mentions specific models (e.g., Llama-2-7b-Chat, Qwen-1.5-14b-Chat) but does not provide specific version numbers for key software libraries or frameworks (e.g., Python, PyTorch, CUDA). |
| Experiment Setup | Yes | Thus, we fix layer 13 for Llama and layer 21 for Qwen for all subsequent experiments. [...] In our experiments, we fix a range of ( 1.5, 1.5) within which we select multipliers to perform contrastive activation addition. |