Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Transformer Key-Value Memories Are Nearly as Interpretable as Sparse Autoencoders

Authors: Mengyu Ye, Jun Suzuki, Tatsuro Inaba, Tatsuki Kuribayashi

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our extensive evaluation revealed that SAE and FFs exhibits a similar range of interpretability, although SAEs displayed an observable but minimal improvement in some aspects. Furthermore, in certain aspects, surprisingly, even vanilla FFs yielded better interpretability than the SAEs, and features discovered in SAEs and FFs diverged. These bring questions about the advantage of SAEs from both perspectives of feature quality and faithfulness, compared to directly interpreting FF feature vectors, and FF key-value parameters serve as a strong baseline in modern interpretability research2.
Researcher Affiliation Academia Mengyu Ye Tohoku University EMAIL Jun Suzuki Tohoku University & RIKEN EMAIL Tatsuro Inaba MBZUAI EMAIL Tatsuki Kuribayashi MBZUAI EMAIL
Pseudocode No The paper describes methods using mathematical equations (e.g., Equation 1: SAE reconstruction, Equation 2: FF layer operation, Equation 3: Top K FF-KV) and natural language, but does not include any clearly labeled pseudocode or algorithm blocks. Appendix A provides implementation details in prose.
Open Source Code Yes We also release the code at https://github.com/muyo8692/ff-kv-sae to facilitate maximum reproducibility of our main results.
Open Datasets Yes The evaluation is conducted on a 4 M-token subset of Open Web Text [19]. The dataset is a subset of the copyright-free version of the Pile [15] (monology/pile-uncopyrighted), comprising 2 M tokens. Spurious Correlation Removal (SCR) evaluates how well spuriously correlated two features (e.g., gender and profession) are disentangled from different features, using the SHIFT [32] data.
Dataset Splits Yes The evaluation is conducted on a 4 M-token subset of Open Web Text [19]. Explained Variance... computed on a 0.4 M-token subset of Open Web Text [19]. Auto-Interpretation Score... The dataset is a subset of the copyright-free version of the Pile [15] (monology/pile-uncopyrighted), comprising 2 M tokens. ... The scoring phase constructs a test set for each feature containing 14 examples, exactly two of which of which are activated texts. SCR and TPP consider top-K activations in evaluations (we adopt K = 20 here, following existing studies [27])
Hardware Specification Yes Most experiments presented in this paper were run on a cluster consisting of the NVIDIA H200 GPUs with 141GB of memory.
Software Dependencies No The paper references various tools and frameworks used for the experiments, such as SAEBENCH [27], SAELens [7], and pretrained SAEs from Gemma Scope [30] and Llama Scope [21], and GPT-4o [35] for auto-interpretation. However, it does not explicitly state specific version numbers for these software dependencies (e.g., Python, PyTorch, or specific versions of the mentioned libraries).
Experiment Setup Yes In our main experiments, we set k = 10 for the Top K FF-KV, and we additionally compare results under varying k values. SCR and TPP consider top-K activations in evaluations (we adopt K = 20 here, following existing studies [27])5. LMs. We evaluate FFs in all layers of five LMs: Gemma-2-2B, Gemma-2-9B [43], Llama-3.1-8B [45], GPT-2 [39], and Pythia-70M [4]. SAEs. We use pretrained SAEs from Gemma Scope [30] (width 16k) for Gemma-2-2B and Gemma2-9B, Llama Scope [21] (width 32k) for Llama-3.1-8B, and SAELens [7] for GPT-2. All SAEs are trained on FF outputs.