Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?
Authors: Denis Sutter, Julian Minder, Thomas Hofmann, Tiago Pimentel
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We complement these theoretical findings with empirical evidence, demonstrating that it is possible to perfectly map models to algorithms even when these models are incapable of solving the actual task; e.g., on an experiment using randomly initialised language models, our alignment maps reach 100% interchange-intervention accuracy on the indirect object identification task. |
| Researcher Affiliation | Academia | EPFL EMAIL, EMAIL, EMAIL |
| Pseudocode | Yes | B Pseudo-code for Running an Intervention on an Algorithm In Fig. 5, we present pseudo-code that demonstrates how algorithms execute under our intervention framework. 1 def f(A): 2 def f A(x, IA=None): 4 for η in A.topological_sort(ηinner ηy): 5 if IA and η in IA: 6 vη = IA[η] A (vpar A(η)) 9 return vηy 10 return f A Figure 5: Pseudo-code implementation of an algorithm with interventions, where interventions IA are specified as a Python dictionary mapping nodes to their intervened values. |
| Open Source Code | Yes | We provide the code to reproduce our experiments in https://github.com/densutter/ non-linear-representation-dilemma. Refer to the README.md for instructions. |
| Open Datasets | Yes | Indirect object identification (IOI) task. In a second set of experiments, we explore this task, inspired by Wang et al. (2023) and using the dataset of Muhia (2022). |
| Dataset Splits | Yes | For the hierarchical equality task, we use a 3-layer MLP with |ψ1| = |ψ2| = |ψ3| = 16. The model is trained using the Adam optimiser with learning rate 0.001 and cross-entropy loss. We use a batch size of 1024 and train on 1,048,576 samples, with 10,000 samples each for evaluation and testing. Training runs for a maximum of 20 epochs with early stopping after 3 epochs of no improvement. For the training progression experiments, we use the same configuration but limit training to 2 epochs. When training the alignment maps ϕ, we use a batch size of 6400 and train for up to 50 epochs with early stopping after 5 epochs of no improvement (using a threshold of 0.001 for the required change, compared to 0 for MLP training). We use the Adam optimiser with learning rate 0.001 and cross-entropy loss. To generate the datasets for DAS, for Alg. 1 we intervene with a probability of 1/3 on ηx1==x2, 1/3 on ηx3==x4, and 1/3 on both variables. The samples for the base and source inputs are generated such that (x1 == x2) and (x3 == x4) each hold 50% of the time. For Alg. 2 and Alg. 3 we intervene on ηx1==x2 and ηx1 for all samples, respectively. For each algorithm, we sample 1,280,000 interventions for training, 10,000 for evaluation, and 10,000 for testing. |
| Hardware Specification | Yes | The experiments on MLP were executed on CPU (10 computers with i7-4770 or newer) over 3 weeks, as we noticed that DAS on small MLPs are faster on CPU than on GPU. The experiments on the Pythia models were executed on a single A100 GPU with 80GB of memory using approximately 30 GPU hours, including the hyperparameter tuning. |
| Software Dependencies | No | The paper mentions PyTorch models but does not specify exact version numbers for PyTorch or any other software libraries used. It refers to 'Python' generally but lacks a specific version. |
| Experiment Setup | Yes | For the hierarchical equality task, we use a 3-layer MLP with |ψ1| = |ψ2| = |ψ3| = 16. The model is trained using the Adam optimiser with learning rate 0.001 and cross-entropy loss. We use a batch size of 1024 and train on 1,048,576 samples, with 10,000 samples each for evaluation and testing. Training runs for a maximum of 20 epochs with early stopping after 3 epochs of no improvement. |