reproducibilityindex.ai

Analysing the Generalisation and Reliability of Steering Vectors

Authors: Daniel Tan, David Chanin, Aengus Lynch, Brooks Paige, Dimitrios Kanoulas, Adrià Garriga-Alonso, Robert Kirk

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we rigorously investigate these properties, and show that steering vectors have substantial limitations both in- and out-of-distribution.
Researcher Affiliation	Collaboration	1 AI Centre, Department of Computer Science, University College London 2 FAR AI 3 Archimedes/Athena RC
Pseudocode	No	The paper describes the steps for 'Steering Vector Extraction' and 'Steering Intervention' using mathematical formulas and descriptive text, but does not provide a formally structured pseudocode or algorithm block.
Open Source Code	Yes	Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We provide access to the code and data.
Open Datasets	Yes	We focus on the Model-Written Evaluations (MWE) datasets [26], a large dataset consisting of prompts from over 100 distinct categories designed to evaluate many specific aspects of models behaviour. Each category contain 1000 samples generated by an LLM, covering a variety of persona and behaviors. For each of these datasets, we construct a 40-10-50 train-val-test split. We also include Truthful QA [17] and the sycophancy dataset [26], as they were used in CAA [30].
Dataset Splits	Yes	For each of these datasets, we construct a 40-10-50 train-val-test split. The validation split is used for hyperparameter selection; we discuss this in Section 4.3.
Hardware Specification	Yes	All experiments were performed using an A100 with 40 GB of VRAM.
Software Dependencies	No	The paper mentions specific models (e.g., Llama-2-7b-Chat, Qwen-1.5-14b-Chat) but does not provide specific version numbers for key software libraries or frameworks (e.g., Python, PyTorch, CUDA).
Experiment Setup	Yes	Thus, we fix layer 13 for Llama and layer 21 for Qwen for all subsequent experiments. [...] In our experiments, we fix a range of ( 1.5, 1.5) within which we select multipliers to perform contrastive activation addition.