Interpretability at Scale: Identifying Causal Mechanisms in Alpaca

Authors: Zhengxuan Wu, Atticus Geiger, Thomas Icard, Christopher Potts, Noah Goodman

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We apply Boundless DAS to the Alpaca model (7B parameters), which, off the shelf, solves a simple numerical reasoning problem. With Boundless DAS, we discover that Alpaca does this by implementing a causal model with two interpretable boolean variables. Furthermore, we find that the alignment of neural representations with these variables is robust to changes in inputs and instructions. These findings mark a first step toward faithfully understanding the inner-workings of our ever-growing and most widely deployed language models. Our tool is extensible to larger LLMs and is released publicly at https://github.com/frankaging/align-transformers.
Researcher Affiliation Academia Zhengxuan Wu , Atticus Geiger , Thomas Icard, Christopher Potts, and Noah D. Goodman Stanford University {wuzhengx, atticusg, icard, cgpotts, ngoodman}@stanford.edu
Pseudocode Yes A.5 Pseudocode for Boundless DAS
Open Source Code Yes Our tool is extensible to larger LLMs and is released publicly at https://github.com/frankaging/align-transformers.
Open Datasets No The core instruction contains an English sentence: Please say yes only if it costs between [X.XX] and [X.XX] dollars, otherwise no. followed by an input dollar amount [X.XX], where [X.XX] are random continuous real numbers drawn with a uniform distribution from [0.00, 9.99].Our training set has 20K examples. Our in-training evaluation is limited to 200 examples for quicker training time, and the hold-out testing dataset has 1K examples. These datasets are generated on the fly for different causal models (i.e., different random seeds result in different training data).
Dataset Splits Yes Our training set has 20K examples. Our in-training evaluation is limited to 200 examples for quicker training time, and the hold-out testing dataset has 1K examples.
Hardware Specification Yes Each alignment experiment is enabled with bfloat16 and can fit within a single A100 GPU. Each alignment experiment takes about 1.5 hrs. To finish all of our experiments, it takes roughly 2 weeks of run-time with 2 A100 nodes with 8 GPUs each.
Software Dependencies No The paper mentions using Adam [26] as an optimizer and the `torch` library for orthogonalized parameterization, but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes To train the rotation matrix, we use Adam [26] as our optimizer with an effective batch size of 64 and a learning rate of 10^-3. We use a different learning rate 10^-2 for our boundary indices for quicker convergence. We train our rotation matrix for 3 epochs... For boundary learning, we anneal temperature from 50.0 to 0.10 with a step number equal to the total training steps and temperature steps with gradient backpropagation.