Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Measuring and Guiding Monosemanticity
Authors: Ruben Härle, Felix Friedrich, Manuel Brack, Björn Deiseroth, Stephan Waeldchen, Patrick Schramowski, Kristian Kersting
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our findings provide actionable guidelines for measuring and advancing mechanistic interpretability and control of LLMs.1 1 Introduction Large Language Models (LLMs) have become widely used due to their ability to generate coherent, contextually relevant text [6, 52, 57]. ... Our experiments demonstrate that G-SAE yields highly monosemantic concepts. ... 5 Experiments & Results In the following, we first define our experimental setup before exhaustively evaluating monosemanticity, concept detection, and steering capabilities of SAEs including G-SAE. ... Tab. 1. This improvement is particularly pronounced for privacy (FMS@1=0.28 vs. 0.62), while improvements for toxicity are more modest (FMS@1=0.26 vs. 0.37). |
| Researcher Affiliation | Collaboration | Ruben Härle1,2,3 Felix Friedrich1,2,4 Manuel Brack4,8 Stephan Wäldchen3 Björn Deiseroth1,2,3,4 Patrick Schramowski1,2,4,5,6 Kristian Kersting1,2,4,5,7 1Computer Science Department, TU Darmstadt, 2Lab1141, 3Aleph Alpha Research, 4Hessian.AI, 5German Research Center for Artificial Intelligence (DFKI), 6CERTAIN, 7Centre of Cognitive Science, TU Darmstadt, 8Adobe Applied Research |
| Pseudocode | Yes | We present its pseudo-algorithm in App. Alg. 1. ... The pseudo code of the described algorithm of Sec. 3 can be seen in Alg. 1 : Algorithm 1 Require: Latents L Ensure: Ordered list of important features, accuracy trend, and indexed trees Initialize: features [], accs [], accs_cum [] 1: T0 tree(L) Train decision tree on L 2: accs_cum = [acc(e) for e in T0] Append accuracies of first tree 3: while Tn has root and not converged(accs) do 4: r root(Tn) Get root feature 5: a acc(r) Measure accuracy using r 6: features.append(r) 7: accsn a 8: Remove r from L Exclude root feature 9: Tn+1 tree(L) Retrain decision tree 10: end while 11: return features, accs, accs_cum |
| Open Source Code | Yes | 1Code available at https://github.com/ml-research/measuring-and-guiding-monosemanticity |
| Open Datasets | Yes | We train (G-)SAEs on three dataset Real Toxicity Prompts (RTP) [13], Shakespeare (SP) [23], and pii-masking-300k (PII) [2] and report both individual and aggregated results. ... [2] Ai4Privacy. pii-masking-300k, 2024. URL https://huggingface.co/datasets/ ai4privacy/pii-masking-300k. [13] Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A. Smith. Real Toxicity Prompts: Evaluating neural toxic degeneration in language models. In Findings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020. [23] Harsh Jhamtani, Varun Gangal, Eduard Hovy, and Eric Nyberg. Shakespearizing modern language using copy-enriched sequence-to-sequence models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2017. |
| Dataset Splits | No | The paper refers to using "train dataset" and "test dataset" for SP and PII (e.g., in Appendix D.2.3), indicating that splits exist. However, it does not explicitly provide details on how these splits are formed (e.g., percentages, absolute counts, or a reference to predefined splits that are themselves fully specified within the paper or its cited sources). For example, it does not specify a train/test/validation ratio or a method for creating these splits, which would be necessary for reproduction. |
| Hardware Specification | Yes | For all experiments we used 1 Nvidia A100 80GB, except for the experiments including LLama3-70B where we used 4 Nvidia A100 80GB. |
| Software Dependencies | No | The paper mentions models like Llama3-8B-base [33] and Distil BERT [42], but does not provide specific version numbers for general software dependencies such as Python, PyTorch, or CUDA libraries that would be required to reproduce the environment. |
| Experiment Setup | Yes | 5.1 Experimental Setup Models. For the main experiments, we used Meta s Llama3-8B-base [33] and extracted activations x after the 3-rd or 11-th transformer block. After encoding, we set k=2048, which results in a ~9% sparse representation of the 24576 dimensional vector f. The latent dimension exceeds the hidden dimension of LLM by a factor of 6. ... D.2.4 Vanilla SAE and G-SAE Both Vanilla SAE and G-SAE were trained for 100 Epochs on the individual datasets with a batch size of 2048 and a learning rate of 1e 5. Table 5: Hyperparameters for Vanilla SAE and G-SAE Method Control Alpha Block Width |