Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Sparse Autoencoders Do Not Find Canonical Units of Analysis

Authors: Patrick Leask, Bart Bussmann, Michael Pearce, Joseph Bloom, Curt Tigges, Noura Al Moubayed, Lee Sharkey, Neel Nanda

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We cast doubt on this belief using two novel techniques: SAE stitching to show they are incomplete, and meta-SAEs to show they are not atomic. SAE stitching involves inserting or swapping latents from a larger SAE into a smaller one. ... Using meta-SAEs SAEs trained on the decoder matrix of another SAE we find that latents in SAEs often decompose into combinations of latents from a smaller SAE... We evaluate the performance of Batch Top K on the activations of two LLMs: GPT-2 Small (residual stream layer 8) and Gemma 2 2B (residual stream layer 12). We use a range of dictionary sizes and values for k, and compare our results to Top K and Jump Re LU SAEs in terms of normalized mean squared error (NMSE) and cross-entropy degradation. ...Our empirical results suggest that simply training larger SAEs is unlikely to result in a canonical set of units for all mechanistic interpretability tasks, and that the choice of dictionary size is subjective.
Researcher Affiliation Collaboration Patrick Leask Department of Computer Science Durham University EMAIL Bart Bussmann Independent EMAIL Michael Pearce Independent Joseph Bloom Decode Research Curt Tigges Decode Research Noura Al Moubayed Department of Computer Science Durham University Lee Sharkey Apollo Research
Pseudocode No The paper describes its methods using mathematical equations and textual explanations, but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks or figures.
Open Source Code No We provide an interactive dashboard to explore meta-SAEs: https://metasaes.streamlit.app/ The GPT-2 SAEs are available on Neuronpedia. We also use two of the Gemma Scope SAEs (Lieberum et al., 2024) trained on Gemma 2 2B (Team et al., 2024). We used the Transformer Lens (https://transformerlensorg.github.io/ Transformer Lens/) implementations of GPT-2 and Gemma 2 2B.
Open Datasets Yes We trained our sparse autoencoders (SAEs) on the Open Web Text dataset1, which was processed into sequences of a maximum of 128 tokens for input into the language models. 1https://huggingface.co/datasets/openwebtext
Dataset Splits No The paper mentions using the Open Web Text dataset and processing it into sequences of a maximum of 128 tokens, but it does not specify any training, validation, or test splits for the dataset.
Hardware Specification No The paper does not explicitly mention the specific hardware (e.g., GPU models, CPU types) used for running the experiments.
Software Dependencies No All models were trained using the Adam optimizer with a learning rate of 3 10 4, β1 = 0.9, and β2 = 0.99. We used the Transformer Lens (https://transformerlensorg.github.io/ Transformer Lens/) implementations of GPT-2 and Gemma 2 2B.
Experiment Setup Yes All models were trained using the Adam optimizer with a learning rate of 3 10 4, β1 = 0.9, and β2 = 0.99. The batch size was 4096, and training continued until a total of 1 109 tokens were processed. We experimented with dictionary sizes of 3072, 6144, 12288, and 24576 for the GPT-2 Small model, and used a dictionary size of 16384 for the experiment on Gemma 2 2B. In both experiments, we varied the number of active latents k among 16, 32, and 64. For the Jump Re LU SAEs, we varied the sparsity coefficient such that the resulting sparsity would match the active latents k of the Batch Top K and Top K models. The sparsity penalties in the experiments on GPT-2 Small were 0.004, 0.0018, and 0.0008. For the Gemma 2 2B model we used sparsity penalties of 0.02, 0.005, and 0.001. In both experiments, we set the bandwidth parameter to 0.001.