Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Towards Interpretability Without Sacrifice: Faithful Dense Layer Decomposition with Mixture of Decoders

Authors: James Oldfield, Shawn Im, Sharon Li, Mihalis A. Nicolaou, Ioannis Patras, Grigorios Chrysos

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The experimental section in the main paper is split into two parts. Section 3.1 first demonstrates how Mx Ds perform significantly better on the accuracy-sparsity frontier as sparse MLP layer approximations on 4 LLMs. We then demonstrate in Section 3.2 that Mx D s features retain the same levels of specialization through sparse probing and steering evaluations.
Researcher Affiliation Academia James Oldfieldm,q Shawn Imm Sharon Lim Mihalis A. Nicolaouc,i Ioannis Patrasq Grigorios G Chrysosm m University of Wisconsin Madison q Queen Mary University of London c University of Cyprus i The Cyprus Institute
Pseudocode No The paper includes mathematical formulas and theoretical derivations in Section 2 (Methodology) and Appendix A (Proofs and additional technical results), but does not contain any clearly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes Our code is included at: https://github.com/ james-oldfield/Mx D/.
Open Datasets Yes We train all sparse layers on a total of 480M tokens of Open Web Text [42]... The details of the datasets used are summarized in Table 7. Dataset # Training examples # Test examples Classification task description Number of classes fancyzhx/ag_news [48] 16,000 4,000 News article topic 4 codeparrot/github-code [105] 20,000 5,000 Programming language 5 amazon_reviews_mcauley_1and5_sentiment [106] 8,000 2,000 Positive/negative review sentiment 2 Helsinki-NLP/europarl [107] 20,000 5,000 European language 5 Lab HC/bias_in_bios [108] 32,000 8,000 Profession from bio 8
Dataset Splits Yes We train all sparse layers on a total of 480M tokens of Open Web Text [42]... For sample-level probing, we truncate the input strings to the first 128 tokens for all datasets but for the Github dataset, where we take the last 128 tokens to avoid license headers [19, 49]. For token-level probing, we instead take only the last 128 tokens, where the final token contains the surname of the individual in question in the datasets of [49]. Binary probes are trained on 80% of the training data (randomly shuffled) with the sklearn library s Logistic Regression module... A random seed of 42 is used throughout the code to ensure reproducibility.
Hardware Specification Yes Table 10: Total training time and resources used to produce the k = 32 experiments (the required compute being roughly the same across models trained with different k). Model GPU used VRAM Training time d_in mlp_expansion_factor Asset link GPT2-124m x1 Ge Force RTX 3090 24GB 8h 34m 37s 768 32 https://huggingface.co/docs/transformers/en/model_doc/gpt2 Pythia-410m x1 Ge Force RTX 3090 24GB 8h 35m 17s 1024 32 https://huggingface.co/Eleuther AI/pythia-410m Pythia-1.4B x1 A100 80GB 23h 25m 23s 2048 32 https://huggingface.co/Eleuther AI/pythia-1.4b Llama-3.2-3B x1 A100 80GB 2d 3m 50s 3072 32 https://huggingface.co/meta-llama/Llama-3.2-3B
Software Dependencies No We include a notebook at https://github.com/ james-oldfield/Mx D/blob/main/form-equivalence.ipynb showing the equivalence in Py Torch. Binary probes are trained on 80% of the training data (randomly shuffled) with the sklearn library s Logistic Regression module with parameters...
Experiment Setup Yes Implementation details We train on 4 base models: GPT2-124M [3], Pythia-410m, Pythia-1.4b [41], and Llama-3.2-3B [1] with up to 80k experts/features. We train all sparse layers on a total of 480M tokens of Open Web Text [42], with learning rate 1e 4 and a context length of 128, initializing the output bias as the empirical mean of the training tokens, and D in Mx Ds as the zero-matrix (following [26]). We vary N in Mx D layers to parameter-match Transcoders in all experiments, with parameter counts and dimensions shown in Table 2. For Llama3.2-3B, we use the Swish-GLU variant of Mx D and GELU-MLP Mx Ds for the other three models, matching the architectures of their base encoders. Through ablation studies in Appendix B.8 we show that Mx Ds using the GELU/GLU variants are much more accurate layer approximators than the Re LU variants. Full experimental details are included in Appendix D.