Improving Sparse Decomposition of Language Model Activations with Gated Sparse Autoencoders

Authors: Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, Janos Kramar, Rohin Shah, Neel Nanda

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate Gated SAEs on multiple models: a one layer GELU activation language model [28], Pythia-2.8B [3] and Gemma-7B [18], and on multiple sites within models: MLP layer outputs, attention layer outputs, and residual stream activations. Across these models and sites, we find Gated SAEs to be a Pareto improvement over baseline SAEs holding training compute fixed (Fig. 1): they yield sparser decompositions at any desired level of reconstruction fidelity. We also conduct further follow up ablations and investigations on a subset of these models and sites to better understand the differences between Gated SAEs and baseline SAEs.
Researcher Affiliation Industry Senthooran Rajamanoharan Google Deep Mind Arthur Conmy Google Deep Mind Lewis Smith Google Deep Mind Tom Lieberum Google Deep Mind Vikrant Varma Google Deep Mind J anos Kram ar Google Deep Mind Rohin Shah Google Deep Mind Neel Nanda Google Deep Mind
Pseudocode Yes See Appendix J for pseudo-code for the forward pass and loss function.
Open Source Code No We are unable to provide open access to the activation datasets or code used to train the SAEs in our experiments.
Open Datasets No We are unable to provide open access to the activation datasets or code used to train the SAEs in our experiments.
Dataset Splits No The paper mentions evaluating models on 'held-out tokens' which implies a test set, but does not provide specific percentages, sample counts, or clear train/validation/test splits for the dataset used during training to reproduce the data partitioning.
Hardware Specification Yes Individual SAEs were each trained on TPU-v3 slices with a 2x2 topology [20].
Software Dependencies No The paper mentions the "Adam optimizer" but does not specify version numbers for any software dependencies required to replicate the experiment.
Experiment Setup Yes We use the Adam optimizer with β2 = 0.999 and β1 = 0.0, following Templeton et al. [47], as we also find this to be a slight improvement to training. We use a learning rate warm-up. ... We use learning rate 0.0003 for all Gated SAE experiments, and the GELU-1L baseline experiment. ... For the Pythia-2.8B and Gemma-7B baseline SAE experiments, we divided the L2 loss by E||x||2, motivated by better hyperparameter transfer, and so changed learning rate to 0.001 and 0.00075. ... We generate activations from sequences of length 128 for GELU-1L, 2048 for Pythia-2.8B and 1024 for Gemma-7B. We use a batch size of 4096 for all runs. We use 300,000 training steps for GELU-1L and Gemma-7B runs, and 400,000 steps for Pythia-2.8B runs.