LLM Circuit Analyses Are Consistent Across Training and Scale

Authors: Curt Tigges, Michael Hanna, Qinan Yu, Stella Biderman

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this study, we track how model mechanisms, operationalized as circuits, emerge and evolve across 300 billion tokens of training in decoder-only LLMs, in models ranging from 70 million to 2.8 billion parameters. We find that task abilities and the functional components that support them emerge consistently at similar token counts across scale. Moreover, although such components may be implemented by different attention heads over time, the overarching algorithm that they implement remains. Surprisingly, both these algorithms and the types of components involved therein tend to replicate across model scale. Finally, we find that circuit size correlates with model size and can fluctuate considerably over time even when the same algorithm is implemented. These results suggest that circuit analyses conducted on small models at the end of pre-training can provide insights that still apply after additional training and over model scale.
Researcher Affiliation Collaboration Curt Tigges Eleuther AI curt@eleuther.ai Michael Hanna ILLC, University of Amsterdam m.w.hanna@uva.nl Qinan Yu Brown University qinan_yu@brown.edu Stella Biderman Eleuther AI stella@eleuther.ai
Pseudocode No The paper describes methods and procedures in paragraph text, but does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes We have included the full code used to run the experiments, along with a readme file, zipped into a file as supplemental material.
Open Datasets Yes We study models from the Pythia suite [5] across 300 billion tokens, at scales from 70 million to 12 billion parameters. We use a small dataset of 70 IOI examples created with Wang et al. s [71] generator, as larger datasets did not provide significantly better results in our experiments and this size fit into GPU memory more easily. We craft 70 examples as in [44]. We create 200 Greater-Than examples with Hanna et al. s [32] generator. We use 200 synthetic SVA example sentences from [52].
Dataset Splits No The paper mentions training and test data, but does not explicitly describe validation data splits or methodology.
Hardware Specification Yes Experiments were conducted over two months a pod of 8 A40 GPUs, each with 50 GB of GPU RAM.
Software Dependencies No The paper does not specify version numbers for software dependencies.
Experiment Setup Yes Each model in the Pythia suite has 154 checkpoints: 11 of these correspond to the model after 0, 1, 2, 4, ..., and 512 steps of training; the remaining 143 correspond to 1000, 2000, ..., and 143,000 steps. We find circuits at each of these checkpoints. As Pythia uses a uniform batch size of 2.1 million tokens, these models are trained on far more tokens (300 billion) than those in existing studies of model internals over time. We study models of varying sizes, from 70 million to 12 billion parameters. We search for the minimal circuit that achieves at least 80% of the whole model s performance on the task. We do this using binary search over circuit sizes; the initial search space ranges from 1 edge to 5% of the model s edges. To determine algorithmic consistency for the IOI circuit, we apply path patching as described in Appendix B in addition to using the component scores described in Appendix D. These are used to set thresholds for classifying attention heads. Though component score thresholds can be arbitrary, applying them consistently across all model checkpoints allows us to see the degree of similarity involved with model behavior. Concretely, we use the following metrics and thresholds: Direct-effect heads We initially perform path-patching on all model attention heads, measuring their impact on the logit different after the final layer of the model. We then classify attention heads as name-mover heads (NMHs), negative name-mover heads, and copy suppression heads (CSHs) based on copy score (for NMHs) or CPSA (for CSHs) of > 10%, which yielded a small set of heads responsible for most of the direct effect. We measure the ratio of the absolute direct effect on logit difference for these heads vs. the total direct effect of all heads (including several unclassified heads) to obtain our first value.