Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Transferring Linear Features Across Language Models With Model Stitching

Authors: Alan Chen, Jack Merullo, Alessandro Stolfo, Ellie Pavlick

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we demonstrate that affine mappings between residual streams of language models is a cheap way to effectively transfer represented features between models. We apply this technique to transfer the weights of Sparse Autoencoders (SAEs) between models of different sizes to compare their representations. We find that small and large models learn similar representation spaces, which motivates training expensive components like SAEs on a smaller model and transferring to a larger model at a FLOPs savings. In particular, using a small-to-large transferred SAE as initialization can lead to 50% cheaper training runs when training SAEs on larger models. Next, we show that transferred probes and steering vectors can effectively recover ground truth performance. Finally, we dive deeper into feature-level transferability, finding that semantic and structural features transfer noticeably differently while specific classes of functional features have their roles faithfully mapped.
Researcher Affiliation Collaboration Alan Chen Brown University EMAIL Jack Merullo Goodfire EMAIL Alessandro Stolfo ETH Zรผrich EMAIL Ellie Pavlick Brown University EMAIL
Pseudocode No The paper describes methods using mathematical equations and structured prose (e.g., Section 2, Equation 3 for the loss function, Section 3.1, Equation 6 for SAE transfer), but does not contain a clearly labeled pseudocode or algorithm block.
Open Source Code Yes The data and code for our experiments are either 1) open source and attributed credit as relevant or 2) included in the supplemental material.
Open Datasets Yes For the training dataset, we collect activations at the desired layers in both models over the first 180k samples of Open Web Text with a context size of 512 (128 for Gemma due to compute constraints) and evaluate over the next 1k samples with the same context size. We mask out all special tokens. We use the Gemma model pair and present the results in Figure 3b, where we plot the % of responses in the target language depending on the type of intervention: no steering, steering with a vector learned from the current model, and steering with a transferred vector learned from the other model. Averaged over L, we find that the transferred steering vector identifies a direction that successfully steers the model toward responding in the target language without explicit prompt instruction to do so. We break down the overall accuracy into individual language pairs (en, L) and find the transferred steering vector works well for some languages but not for others by defining a clipped relative transfer gap as the ratio of transfer steering performance to ground truth steering performance clipped to [0, 1] for visualization (Figure 3c). We also note a positive correlation between language frequency and steering transfer effectiveness ( F.1). We evaluate over a subset of 163 prompts from the IFEval dataset with the instructions stripped and aggregate the proportion of responses in the target language [Zhou et al., 2023, Stolfo et al., 2025].
Dataset Splits Yes For the training dataset, we collect activations at the desired layers in both models over the first 180k samples of Open Web Text with a context size of 512 (128 for Gemma due to compute constraints) and evaluate over the next 1k samples with the same context size. We evaluate over a subset of 163 prompts from the IFEval dataset with the instructions stripped and aggregate the proportion of responses in the target language [Zhou et al., 2023, Stolfo et al., 2025]. We compute a steering vector over the first 100 paired examples in each en-L dataset by unit-normalizing the difference in mean activation on L tokens vs. English tokens [Panickssery et al., 2023].
Hardware Specification Yes All experiments were run on single Quadro 6000 / RTX3090 (both have 24GB VRAM) configurations.
Software Dependencies No The paper mentions using 'SAELens' and 'Adam optimizer' and 'torch.utils.flop_counter.Flop Counter Mode' (implying PyTorch), but does not provide specific version numbers for any of these software components.
Experiment Setup Yes For the training dataset, we collect activations at the desired layers in both models over the first 180k samples of Open Web Text with a context size of 512 (128 for Gemma due to compute constraints) and evaluate over the next 1k samples with the same context size. We use the Adam optimizer with a learning rate of 1e-4 and clip gradient norms to 1.0. We found minimal sensitivity to learning rate schedule, so we just use a cosine annealing decay, and found that 2 epochs is sufficient for convergence (though even 1 is probably enough). All SAEs we train are Top K SAEs trained using SAELens on unnormalized residual stream activations. We train SAEs with latent sizes 4096, 8192, 16384, 32768, and 65536. We abide by the following practices: 1. We normalize the decoder vectors to unit norm each iteration. 2. When randomly initializing, we initialize the decoder and encoder as transposes of each other. 3. We use a constant learning rate schedule and just use 0.0001 as the learning rate. 4. We do not use an auxiliary loss for ease of FLOPs estimation (discussed below). All SAEs are trained with sparsity k = 64 and width 32k. A full run is 4B tokens (120k iterations) for the SAEs and 200M (36k iterations) tokens for the stitch. We prompt gemma-2-9b-it 5 times with temperature 1.0 using the following prompt (generated using GPT-4 and mildly edited), resulting in 6 versions of the same sentence.