Interpretability at Scale: Identifying Causal Mechanisms in Alpaca
Authors: Zhengxuan Wu, Atticus Geiger, Thomas Icard, Christopher Potts, Noah Goodman
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We apply Boundless DAS to the Alpaca model (7B parameters), which, off the shelf, solves a simple numerical reasoning problem. With Boundless DAS, we discover that Alpaca does this by implementing a causal model with two interpretable boolean variables. Furthermore, we find that the alignment of neural representations with these variables is robust to changes in inputs and instructions. These findings mark a first step toward faithfully understanding the inner-workings of our ever-growing and most widely deployed language models. Our tool is extensible to larger LLMs and is released publicly at https://github.com/frankaging/align-transformers. |
| Researcher Affiliation | Academia | Zhengxuan Wu , Atticus Geiger , Thomas Icard, Christopher Potts, and Noah D. Goodman Stanford University {wuzhengx, atticusg, icard, cgpotts, ngoodman}@stanford.edu |
| Pseudocode | Yes | A.5 Pseudocode for Boundless DAS |
| Open Source Code | Yes | Our tool is extensible to larger LLMs and is released publicly at https://github.com/frankaging/align-transformers. |
| Open Datasets | No | The core instruction contains an English sentence: Please say yes only if it costs between [X.XX] and [X.XX] dollars, otherwise no. followed by an input dollar amount [X.XX], where [X.XX] are random continuous real numbers drawn with a uniform distribution from [0.00, 9.99].Our training set has 20K examples. Our in-training evaluation is limited to 200 examples for quicker training time, and the hold-out testing dataset has 1K examples. These datasets are generated on the fly for different causal models (i.e., different random seeds result in different training data). |
| Dataset Splits | Yes | Our training set has 20K examples. Our in-training evaluation is limited to 200 examples for quicker training time, and the hold-out testing dataset has 1K examples. |
| Hardware Specification | Yes | Each alignment experiment is enabled with bfloat16 and can fit within a single A100 GPU. Each alignment experiment takes about 1.5 hrs. To finish all of our experiments, it takes roughly 2 weeks of run-time with 2 A100 nodes with 8 GPUs each. |
| Software Dependencies | No | The paper mentions using Adam [26] as an optimizer and the `torch` library for orthogonalized parameterization, but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | To train the rotation matrix, we use Adam [26] as our optimizer with an effective batch size of 64 and a learning rate of 10^-3. We use a different learning rate 10^-2 for our boundary indices for quicker convergence. We train our rotation matrix for 3 epochs... For boundary learning, we anneal temperature from 50.0 to 0.10 with a step number equal to the total training steps and temperature steps with gradient backpropagation. |