Sparse Autoencoders Find Highly Interpretable Features in Language Models

Authors: Robert Huben, Hoagy Cunningham, Logan Riggs Smith, Aidan Ewart, Lee Sharkey

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental These autoencoders learn sets of sparsely activating features that are more interpretable and monosemantic than directions identified by alternative approaches, where interpretability is measured by automated methods. First, we show that our features are on average more interpretable than neurons and other matrix decomposition techniques, as measured by autointerpretability scores (Section 3) (Bills et al., 2023). Next, we show that we are able to pinpoint the features used for a set task more precisely than other methods (Section 4).
Researcher Affiliation Collaboration Hoagy Cunningham 12, Aidan Ewart 13, Logan Riggs 1, Robert Huben, Lee Sharkey4 1Eleuther AI, 2MATS, 3University of Bristol, 4Apollo Research
Pseudocode No The paper describes a procedure in Section 4.1 but does not present it as a formally structured pseudocode or algorithm block.
Open Source Code Yes Code to replicate experiments can be found at https://github.com/Hoagy C/sparse_coding
Open Datasets Yes To train the sparse autoencoder described in Section 2, we use data from the Pile (Gao et al., 2020), a large, public webtext corpus.
Dataset Splits No The paper mentions training on 'The Pile' and evaluating on a 'test set of 50 IOI data points' for a specific task, but it does not provide explicit training, validation, and test dataset splits with percentages or sample counts for the primary autoencoder training.
Hardware Specification Yes A single training run using this quantity of data completes in under an hour on a single A40 GPU.
Software Dependencies No The paper mentions models like GPT-4 and GPT-3.5 and the Adam optimiser, but does not provide specific version numbers for software libraries or dependencies (e.g., 'Python 3.8, PyTorch 1.9, and CUDA 11.1') needed for replication.
Experiment Setup Yes Our autoencoder is trained to minimise the loss function L(x) = ||x ˆx||2 2 dim(x) + α||c||1 where α is a hyperparameter controlling the sparsity of the reconstruction... The autoencoders are trained with the Adam optimiser with a learning rate of 1e-3 and are trained on 5-50M activation vectors for 1-3 epochs...