Towards Automated Circuit Discovery for Mechanistic Interpretability
Authors: Arthur Conmy, Augustine Mavor-Parker, Aengus Lynch, Stefan Heimersheim, Adrià Garriga-Alonso
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This paper systematizes the mechanistic interpretability process they followed. First, researchers choose a metric and dataset that elicit the desired model behavior. Then, they apply activation patching to find which abstract neural network units are involved in the behavior. By varying the dataset, metric, and units under investigation, researchers can understand the functionality of each component. We automate one of the process steps: finding the connections between the abstract neural network units that form a circuit. We propose several algorithms and reproduce previous interpretability results to validate them. For example, the ACDC algorithm rediscovered 5/5 of the component types in a circuit in GPT-2 Small that computes the Greater-Than operation. ACDC selected 68 of the 32,000 edges in GPT-2 Small, all of which were manually found by previous work. Our code is available at https://github.com/Arthur Conmy/Automatic-Circuit-Discovery. |
| Researcher Affiliation | Collaboration | Arthur Conmy Independent Augustine N. Mavor-Parker UCL Aengus Lynch UCL Stefan Heimersheim University of Cambridge Adrià Garriga-Alonso FAR AI |
| Pseudocode | Yes | Algorithm 1 describes ACDC. [...] Algorithm 1: The ACDC algorithm. Data: Computational graph G, dataset (xi)n i=1, corrupted datapoints (x i)n i=1 and threshold τ > 0. Result: Subgraph H G. [...] |
| Open Source Code | Yes | Our code is available at https://github.com/Arthur Conmy/Automatic-Circuit-Discovery. |
| Open Datasets | Yes | Our IOI experiments were conducted with a dataset of N = 50 text examples from one template the IOI paper used (...). The corrupted dataset was examples from the ABC dataset (Wang et al., 2023) (...). We use a random sample of 100 datapoints from the dataset provided by Hanna, Liu, and Variengien (2023). (...) We use 40 sequences of 300 tokens from a filtered validation set of Open Web Text (Gokaslan et al., 2019). The model is available in the Transformer Lens (Nanda, 2022) library. |
| Dataset Splits | No | Note that we use the term dataset to refer to a collection of prompts that elicit some behavior in a model: we do not train models on these examples, as in this paper we focus on post-hoc interpretability. |
| Hardware Specification | Yes | In the IOI experiment in Figure 1, we did not split the computational graph into the query, key and value calculations for each head. This enabled the ACDC run to complete in 8 minutes on an NVIDIA A100 GPU. |
| Software Dependencies | Yes | The model can be loaded with transformer_lens.Hooked Transformer.from_pretrained(model_name = "redwood_attn_2l", center_writing_weights = False, center_unembed = False) (at least for the repository version of the source code as of 23rd May 2023) |
| Experiment Setup | Yes | The order in which we iterate over the parents w of v is a hyperparameter. (...) The regularization coefficients λ (in the notation of Cao, Sanh, and Rush (2021)) we used in Figure 4 were 0.01, 0.0158, 0.0251, 0.0398, 0.0631, 0.1, 0.158, 0.251, 0.398, 0.631, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 30, 50, 70, 90, 110, 130, 150, 170, 190, 210, 230, 250. |