Sparse Autoencoders Find Highly Interpretable Features in Language Models
Authors: Robert Huben, Hoagy Cunningham, Logan Riggs Smith, Aidan Ewart, Lee Sharkey
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | These autoencoders learn sets of sparsely activating features that are more interpretable and monosemantic than directions identified by alternative approaches, where interpretability is measured by automated methods. First, we show that our features are on average more interpretable than neurons and other matrix decomposition techniques, as measured by autointerpretability scores (Section 3) (Bills et al., 2023). Next, we show that we are able to pinpoint the features used for a set task more precisely than other methods (Section 4). |
| Researcher Affiliation | Collaboration | Hoagy Cunningham 12, Aidan Ewart 13, Logan Riggs 1, Robert Huben, Lee Sharkey4 1Eleuther AI, 2MATS, 3University of Bristol, 4Apollo Research |
| Pseudocode | No | The paper describes a procedure in Section 4.1 but does not present it as a formally structured pseudocode or algorithm block. |
| Open Source Code | Yes | Code to replicate experiments can be found at https://github.com/Hoagy C/sparse_coding |
| Open Datasets | Yes | To train the sparse autoencoder described in Section 2, we use data from the Pile (Gao et al., 2020), a large, public webtext corpus. |
| Dataset Splits | No | The paper mentions training on 'The Pile' and evaluating on a 'test set of 50 IOI data points' for a specific task, but it does not provide explicit training, validation, and test dataset splits with percentages or sample counts for the primary autoencoder training. |
| Hardware Specification | Yes | A single training run using this quantity of data completes in under an hour on a single A40 GPU. |
| Software Dependencies | No | The paper mentions models like GPT-4 and GPT-3.5 and the Adam optimiser, but does not provide specific version numbers for software libraries or dependencies (e.g., 'Python 3.8, PyTorch 1.9, and CUDA 11.1') needed for replication. |
| Experiment Setup | Yes | Our autoencoder is trained to minimise the loss function L(x) = ||x ˆx||2 2 dim(x) + α||c||1 where α is a hyperparameter controlling the sparsity of the reconstruction... The autoencoders are trained with the Adam optimiser with a learning rate of 1e-3 and are trained on 5-50M activation vectors for 1-3 epochs... |