Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Overcoming Sparsity Artifacts in Crosscoders to Interpret Chat-Tuning

Authors: Julian Minder, Clément Dumas, Caden Juang, Bilal Chughtai, Neel Nanda

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In experiments comparing Gemma 2 2B base and chat models, we observe that the standard crosscoder suffers heavily from these issues. We apply autointerpretability methods to compare interpretability between the crosscoders. In Figure 4, we plot the KL divergence for different experiments on 512 chat interactions, with user requests from Ding et al. s [Ding et al., 2023] dataset and responses generated by the chat model.
Researcher Affiliation Academia Julian Minder @A Clément Dumas Caden Juang D Bilal Chughtai Neel Nanda @EPFL AETHZ Ecole Normale Supérieure Paris-Saclay Université Paris-Saclay DNortheastern University EMAIL, EMAIL
Pseudocode No The paper describes methods and mathematical formulations but does not contain any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes We open-source our code, training library, models, wandb runs and a demo notebook to explore latents. We provide open access to the data and code in the supplemental material ?? and appendix K. Access to the crosscoder models will be provided upon deanonymization.
Open Datasets Yes user requests from Ding et al. s [Ding et al., 2023] dataset. Training Data: 100M tokens from Fineweb (web data; ODC-By v1.0 License) [Penedo et al., 2023] and lmsys-chat (chat data; Custom License) [Zheng et al., 2024], respectively.
Dataset Splits Yes We compute the reconstruction and error ratios (νr j and νε j ), for all L1 crosscoder chat-only latents on 50M tokens from the training set. In Figure 4, we plot the KL divergence for different experiments on 512 chat interactions. on the training set Dtrain (containing data from LMSYS and Fine Web). activation in the validation set (referred to as "dead" latents).
Hardware Specification Yes All of the experiments in this paper can be reproduced in approximately 180 GPU/h of NVIDIA H100 GPUs. Crosscoder Training: 10h on an A100 per crosscoder
Software Dependencies No We use the tools nnsight (MIT License) [Fiotto-Kaufman et al., 2024] and a branch of dictionary_learning (MIT License) [Marks et al., 2024] to train the crosscoder. Specifically, we load the model using the transformers library from Wolf et al. [2020]. The paper lists software but does not specify version numbers for these components.
Experiment Setup Yes Base Model: Gemma 2 2B. Chat Model: Gemma 2 2B it. Layer used: 13 (of 26). Expansion factor: 32, resulting in 73728 latents. Initialization: Decoder initialized as the transpose of the encoder weights. Encoder and decoder for both models are paired with the same initial weights. The L1 crosscoder is initialized to have a norm of 0.05 while the Batch Top K crosscoder is initialized to have a norm of 1.0. Refer to Table 2 and Table 3 for Learning Rate, µ, and k values.