reproducibilityindex.ai

Interpretability at Scale: Identifying Causal Mechanisms in Alpaca

Authors: Zhengxuan Wu, Atticus Geiger, Thomas Icard, Christopher Potts, Noah Goodman

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We apply Boundless DAS to the Alpaca model (7B parameters), which, off the shelf, solves a simple numerical reasoning problem. With Boundless DAS, we discover that Alpaca does this by implementing a causal model with two interpretable boolean variables. Furthermore, we find that the alignment of neural representations with these variables is robust to changes in inputs and instructions. These findings mark a first step toward faithfully understanding the inner-workings of our ever-growing and most widely deployed language models. Our tool is extensible to larger LLMs and is released publicly at https://github.com/frankaging/align-transformers.
Researcher Affiliation	Academia	Zhengxuan Wu , Atticus Geiger , Thomas Icard, Christopher Potts, and Noah D. Goodman Stanford University {wuzhengx, atticusg, icard, cgpotts, ngoodman}@stanford.edu
Pseudocode	Yes	A.5 Pseudocode for Boundless DAS
Open Source Code	Yes	Our tool is extensible to larger LLMs and is released publicly at https://github.com/frankaging/align-transformers.
Open Datasets	No	The core instruction contains an English sentence: Please say yes only if it costs between [X.XX] and [X.XX] dollars, otherwise no. followed by an input dollar amount [X.XX], where [X.XX] are random continuous real numbers drawn with a uniform distribution from [0.00, 9.99].Our training set has 20K examples. Our in-training evaluation is limited to 200 examples for quicker training time, and the hold-out testing dataset has 1K examples. These datasets are generated on the fly for different causal models (i.e., different random seeds result in different training data).
Dataset Splits	Yes	Our training set has 20K examples. Our in-training evaluation is limited to 200 examples for quicker training time, and the hold-out testing dataset has 1K examples.
Hardware Specification	Yes	Each alignment experiment is enabled with bfloat16 and can fit within a single A100 GPU. Each alignment experiment takes about 1.5 hrs. To finish all of our experiments, it takes roughly 2 weeks of run-time with 2 A100 nodes with 8 GPUs each.
Software Dependencies	No	The paper mentions using Adam [26] as an optimizer and the `torch` library for orthogonalized parameterization, but does not provide specific version numbers for these or other software dependencies.
Experiment Setup	Yes	To train the rotation matrix, we use Adam [26] as our optimizer with an effective batch size of 64 and a learning rate of 10^-3. We use a different learning rate 10^-2 for our boundary indices for quicker convergence. We train our rotation matrix for 3 epochs... For boundary learning, we anneal temperature from 50.0 to 0.10 with a step number equal to the total training steps and temperature steps with gradient backpropagation.