Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Revising and Falsifying Sparse Autoencoder Feature Explanations

Authors: George Ma, Samuel Pfrommer, Somayeh Sojoudi

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our main contributions are listed below and summarized in Figure 1. 1. We provide a method for sourcing close negatives to top-activating sentences in the dataset, and show that these more effectively falsify explanations than the random sentences used in prior work. [...] 4. Through empirical analysis, we show that both the structured explanation format and the treebased explainer improve the quality of feature explanations. Semantically similar negatives more effectively falsify explanations and reveal the recall bias in current interpretability methods. We further investigate how feature complexity and polysemanticity evolve across LLM layers. 5 Experiments In Section 5.1, we compare various methods for sourcing complementary sentences. Section 5.2 documents our improvements to the explanation generation process. Finally, Section 5.3 analyzes the impact of our structured explanations on the composition of SAE features as a function of layer.
Researcher Affiliation Academia George Ma1 Samuel Pfrommer1 Somayeh Sojoudi1 1University of California, Berkeley Equal contribution. Correspondence to: George Ma (EMAIL).
Pseudocode Yes The pseudocode of the tree-based explainer is shown in Algorithm 1. Algorithm 1 Tree-Based Explainer for SAE Feature Interpretation Require: Training and validation records for a single SAE feature, each containing top-activating and complementary sentences with ground-truth activations Ensure: Natural language explanation for the target SAE feature
Open Source Code Yes Code is available at https://github.com/George MLP/feature-interp.
Open Datasets Yes All experiments are conducted using an uncopyrighted subset of the Pile [8, 25]. ... [8] Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The Pile: An 800GB Dataset of Diverse Text for Language Modeling. ar Xiv preprint ar Xiv:2101.00027, 2020. ... [25] Monology. Pile Uncopyrighted: A Copyright-Filtered Subset of The Pile. https:// huggingface.co/datasets/monology/pile-uncopyrighted, 2023. Dataset retrieved 2025-05-13.
Dataset Splits Yes All experiments are conducted using an uncopyrighted subset of the Pile [8, 25]. We conduct experiments on a subset of 100,000 sentences and chunk them into sequences with 32 tokens. For all experiments, we use the open-source Llama 4 Scout to generate explanations [22]. Our subject language models are gemma-2-9b, llama-3.1-8b, and gpt-2-small. We leveraged pretrained SAEs of comparable widths, using the 16k Gemma scope SAEs [19] and 32k Llama scope and GPT-2 SAEs [9, 13]. We use the first 50 SAE features of each layer in our experiments. In this section, we evaluate the four complementary sentence sourcing strategies introduced in Section 4.4. Following the setup of Bills et al. [1], we provide the explainer LLM with 10 topactivating sentences as the training dataset for each feature and prompt it to generate an explanation. We then use a simulator LLM to predict feature activations on a test dataset, which consists of 10 top-activating sentences and 10 complementary sentences.
Hardware Specification Yes Our experiments were conducted on 40 gigabyte A100 GPU instances with 32 CPU cores each. We simulate all our explanations using the gemma-2-27b-it model with a bilevel key-value caching scheme for efficiency (Appendix A.2). For simulating structured explanations, we perform multiple passes of the above simulation scheme, one per explanation component. Further details are deferred to Section 4.3. The small size and compression of gemma-2-27b-it allows for local simulation on a single 40 GB Nvidia A100 GPU.
Software Dependencies No The paper mentions using specific models like "gemma-2-27b-it" and "gemma-2-9b" but does not provide version numbers for general software dependencies such as programming languages, libraries, or frameworks (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes Hyperparameters. For the one-shot explainer, we use a temperature of 1 and a top-p value of 1 for explanation generation, with a maximum of 10 rules (i.e., components in the structured explanation). For the tree-based explainer, we set the temperature to 1.2 and top-p to 1, with a maximum of 5 rules. The tree is initialized with 3 root nodes, a maximum depth of 2, and a branching factor of 2, meaning each node generates 2 candidate explanations after evaluation and feedback. The width is set to 2, retaining the top 2 scoring explanations at each iteration and discarding the rest.