Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Improving Sparse Decomposition of Language Model Activations with Gated Sparse Autoencoders

Authors: Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, Janos Kramar, Rohin Shah, Neel Nanda

NeurIPS 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate Gated SAEs on multiple models: a one layer GELU activation language model [28], Pythia-2.8B [3] and Gemma-7B [18], and on multiple sites within models: MLP layer outputs, attention layer outputs, and residual stream activations. Across these models and sites, we find Gated SAEs to be a Pareto improvement over baseline SAEs holding training compute fixed (Fig. 1): they yield sparser decompositions at any desired level of reconstruction fidelity. We also conduct further follow up ablations and investigations on a subset of these models and sites to better understand the differences between Gated SAEs and baseline SAEs.
Researcher Affiliation	Industry	Senthooran Rajamanoharan Google Deep Mind Arthur Conmy Google Deep Mind Lewis Smith Google Deep Mind Tom Lieberum Google Deep Mind Vikrant Varma Google Deep Mind J anos Kram ar Google Deep Mind Rohin Shah Google Deep Mind Neel Nanda Google Deep Mind
Pseudocode	Yes	See Appendix J for pseudo-code for the forward pass and loss function.
Open Source Code	No	We are unable to provide open access to the activation datasets or code used to train the SAEs in our experiments.
Open Datasets	No	We are unable to provide open access to the activation datasets or code used to train the SAEs in our experiments.
Dataset Splits	No	The paper mentions evaluating models on 'held-out tokens' which implies a test set, but does not provide specific percentages, sample counts, or clear train/validation/test splits for the dataset used during training to reproduce the data partitioning.
Hardware Specification	Yes	Individual SAEs were each trained on TPU-v3 slices with a 2x2 topology [20].
Software Dependencies	No	The paper mentions the "Adam optimizer" but does not specify version numbers for any software dependencies required to replicate the experiment.
Experiment Setup	Yes	We use the Adam optimizer with β2 = 0.999 and β1 = 0.0, following Templeton et al. [47], as we also find this to be a slight improvement to training. We use a learning rate warm-up. ... We use learning rate 0.0003 for all Gated SAE experiments, and the GELU-1L baseline experiment. ... For the Pythia-2.8B and Gemma-7B baseline SAE experiments, we divided the L2 loss by E\|\|x\|\|2, motivated by better hyperparameter transfer, and so changed learning rate to 0.001 and 0.00075. ... We generate activations from sequences of length 128 for GELU-1L, 2048 for Pythia-2.8B and 1024 for Gemma-7B. We use a batch size of 4096 for all runs. We use 300,000 training steps for GELU-1L and Gemma-7B runs, and 400,000 steps for Pythia-2.8B runs.