Towards Best Practices of Activation Patching in Language Models: Metrics and Methods
Authors: Fred Zhang, Neel Nanda
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we systematically examine the impact of methodological details in activation patching, including evaluation metrics and corruption methods. In several settings of localization and circuit discovery in language models, we find that varying these hyperparameters could lead to disparate interpretability results. Backed by empirical observations, we give conceptual arguments for why certain metrics or methods may be preferred. |
| Researcher Affiliation | Collaboration | Fred Zhang UC Berkeley z0@berkeley.edu Neel Nanda Independent neelnanda27@gmail.com |
| Pseudocode | No | The paper describes methods in prose and figures but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper states: 'All our experiments are performed using the original codebase of Wang et al. (2023), available at https://github.com/redwoodresearch/Easy-Transformer.' This refers to a third-party codebase used, not source code for the specific comparative methodology developed or described in this paper. |
| Open Datasets | Yes | Dataset and corruption method STR requires pairs of Xclean and Xcorrupt that are semantically similar. To perform STR, we construct PAIREDFACTS of 145 pairs of prompts on factual recall. All the prompts are in-distribution, as they are selected from the original dataset of Meng et al. (2022); see Appendix C for details. ... The full dataset is available at https://www.jsonkeeper.com/b/P1GL. |
| Dataset Splits | No | The paper describes the datasets used and the experimental procedure but does not provide specific details on how the datasets are split into training, validation, and test sets for reproducibility. |
| Hardware Specification | No | The paper mentions the language models used (e.g., GPT-2 XL, GPT-J 6B) but does not provide specific details about the hardware (CPU, GPU, memory, etc.) on which the experiments were conducted. |
| Software Dependencies | No | The paper mentions the use of 'Transformer Lens library (Nanda & Bloom, 2022)' and the 'original codebase of Wang et al. (2023)' but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | Experiment setup For factual recall, we investigate Meng et al. (2022) s hypothesis that model computation is concentrated at early-middle MLP layers (by processing the last subject token). Specifically, we corrupt the subject token(s) to generate Xcorrupt. In the patched run, we override the MLP activations at the last subject token. Following Meng et al. (2022); Hase et al. (2023), at each layer we restore a set of 5 adjacent MLP layers. (More results on other window sizes can be found in Section H.1. We examine sliding window patching more closely in Section 5.) For IOI circuit discovery, we follow Wang et al. (2023) and focus on the role of attention heads. Corruption is applied to the S2 token. Then we patch a single attention head s output (at all positions) and iterate over all heads in this way. To avoid relying on visual inspection, we say that a head is detected if its patching effect is 2 standard deviations (SD) away from the mean effect. |