Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Dense SAE Latents Are Features, Not Bugs
Authors: Xiaoqing Sun, Alessandro Stolfo, Joshua Engels, Ben Wu, Senthooran Rajamanoharan, Mrinmaya Sachan, Max Tegmark
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we systematically investigate the geometry, function, and origin of dense latents and show that they are not only persistent but often reflect meaningful model representations. We first demonstrate that dense latents tend to form antipodal pairs that reconstruct specific directions in the residual stream, and that ablating their subspace suppresses the emergence of new dense features in retrained SAEs suggesting that high density features are an intrinsic property of the residual space. |
| Researcher Affiliation | Academia | Xiaoqing Sun MIT Alessandro Stolfo ETH Zürich Joshua Engels MIT Ben Wu University of Sheffield Senthooran Rajamanoharan Mrinmaya Sachan ETH Zürich Max Tegmark MIT |
| Pseudocode | No | The paper includes mathematical equations (1) and (2) but does not present any explicitly labeled pseudocode or algorithm blocks with structured steps. |
| Open Source Code | No | Justification: We plan to upload our code by the supplementary materials deadline. |
| Open Datasets | Yes | Experimental Setup. We focus our investigation on the Gemma Scope SAEs [Lieberum et al., 2024] trained on Gemma 2 2B [Gemma Team, 2024], which use a Jump Re LU activation function [Rajamanoharan et al., 2024b]. We additionally train Top K SAEs [Gao et al., 2025] on 1B tokens of the Open Web Text corpus [Gokaslan and Cohen, 2019] for our experiments in 3.1.3 Activation densities for Gemma Scope latents are from Neuronpedia [Lin, 2023], while densities for our Top K SAEs are computed over 100M tokens from the C4 Corpus [Raffel et al., 2020]. Full experimental details are in Appendix B. Meaningful-Word Latents The next class of latents that we investigate are those whose firing can be well predicted by the part-of-speech (POS) tag of the token. We create a reduced set of high-level tags from the Brown Corpus [Francis and Kuˇcera, 1979] by combining similar tags (e.g., combining plural and singular forms of nouns),8 and capture dense latent activations on 10k sentences ( 200k tokens) from the corpus. |
| Dataset Splits | No | The paper mentions training on 1B tokens from Open Web Text, computing densities over 100M tokens from C4 Corpus, and capturing projections for 5000 1024-token-long contexts, and using 10k sentences from the Brown Corpus. However, it does not specify explicit training/validation/test splits or percentages for reproducing the general experimental setup. |
| Hardware Specification | Yes | We expect the experiments for training SAEs, capturing SAE activations and generating completions with Gemma 2 2B to be able to be run in about 30 A6000 hours. |
| Software Dependencies | No | All experiments were implemented in Py Torch [Paszke et al., 2019], with model inspection tools from the Transformer Lens library [Nanda and Bloom, 2022]. Data processing used Num Py [Harris et al., 2020] and Pandas [Wes Mc Kinney, 2010], and figures were generated with Plotly [Plotly Technologies Inc., 2015]. |
| Experiment Setup | Yes | Experimental Setup. We focus our investigation on the Gemma Scope SAEs [Lieberum et al., 2024] trained on Gemma 2 2B [Gemma Team, 2024], which use a Jump Re LU activation function [Rajamanoharan et al., 2024b]. We additionally train Top K SAEs [Gao et al., 2025] on 1B tokens of the Open Web Text corpus [Gokaslan and Cohen, 2019] for our experiments in 3.1.3 Activation densities for Gemma Scope latents are from Neuronpedia [Lin, 2023], while densities for our Top K SAEs are computed over 100M tokens from the C4 Corpus [Raffel et al., 2020]. Full experimental details are in Appendix B. Appendix B: For the experiment in 3.1, we trained Top K SAEs [Gao et al., 2025] on the residual stream activations at layer 25 of Gemma 2 2B using 1 billion tokens from the Open Web Text corpus [Gokaslan and Cohen, 2019]. Training followed the default configuration of the Sparsify library,10 and experiment tracking was conducted using Weights & Biases.11 |