Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Sparse Autoencoders Do Not Find Canonical Units of Analysis
Authors: Patrick Leask, Bart Bussmann, Michael Pearce, Joseph Bloom, Curt Tigges, Noura Al Moubayed, Lee Sharkey, Neel Nanda
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We cast doubt on this belief using two novel techniques: SAE stitching to show they are incomplete, and meta-SAEs to show they are not atomic. SAE stitching involves inserting or swapping latents from a larger SAE into a smaller one. ... Using meta-SAEs SAEs trained on the decoder matrix of another SAE we find that latents in SAEs often decompose into combinations of latents from a smaller SAE... We evaluate the performance of Batch Top K on the activations of two LLMs: GPT-2 Small (residual stream layer 8) and Gemma 2 2B (residual stream layer 12). We use a range of dictionary sizes and values for k, and compare our results to Top K and Jump Re LU SAEs in terms of normalized mean squared error (NMSE) and cross-entropy degradation. ...Our empirical results suggest that simply training larger SAEs is unlikely to result in a canonical set of units for all mechanistic interpretability tasks, and that the choice of dictionary size is subjective. |
| Researcher Affiliation | Collaboration | Patrick Leask Department of Computer Science Durham University EMAIL Bart Bussmann Independent EMAIL Michael Pearce Independent Joseph Bloom Decode Research Curt Tigges Decode Research Noura Al Moubayed Department of Computer Science Durham University Lee Sharkey Apollo Research |
| Pseudocode | No | The paper describes its methods using mathematical equations and textual explanations, but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks or figures. |
| Open Source Code | No | We provide an interactive dashboard to explore meta-SAEs: https://metasaes.streamlit.app/ The GPT-2 SAEs are available on Neuronpedia. We also use two of the Gemma Scope SAEs (Lieberum et al., 2024) trained on Gemma 2 2B (Team et al., 2024). We used the Transformer Lens (https://transformerlensorg.github.io/ Transformer Lens/) implementations of GPT-2 and Gemma 2 2B. |
| Open Datasets | Yes | We trained our sparse autoencoders (SAEs) on the Open Web Text dataset1, which was processed into sequences of a maximum of 128 tokens for input into the language models. 1https://huggingface.co/datasets/openwebtext |
| Dataset Splits | No | The paper mentions using the Open Web Text dataset and processing it into sequences of a maximum of 128 tokens, but it does not specify any training, validation, or test splits for the dataset. |
| Hardware Specification | No | The paper does not explicitly mention the specific hardware (e.g., GPU models, CPU types) used for running the experiments. |
| Software Dependencies | No | All models were trained using the Adam optimizer with a learning rate of 3 10 4, β1 = 0.9, and β2 = 0.99. We used the Transformer Lens (https://transformerlensorg.github.io/ Transformer Lens/) implementations of GPT-2 and Gemma 2 2B. |
| Experiment Setup | Yes | All models were trained using the Adam optimizer with a learning rate of 3 10 4, β1 = 0.9, and β2 = 0.99. The batch size was 4096, and training continued until a total of 1 109 tokens were processed. We experimented with dictionary sizes of 3072, 6144, 12288, and 24576 for the GPT-2 Small model, and used a dictionary size of 16384 for the experiment on Gemma 2 2B. In both experiments, we varied the number of active latents k among 16, 32, and 64. For the Jump Re LU SAEs, we varied the sparsity coefficient such that the resulting sparsity would match the active latents k of the Batch Top K and Top K models. The sparsity penalties in the experiments on GPT-2 Small were 0.004, 0.0018, and 0.0008. For the Gemma 2 2B model we used sparsity penalties of 0.02, 0.005, and 0.001. In both experiments, we set the bandwidth parameter to 0.001. |