Interpretability Illusions in the Generalization of Simplified Models

Authors: Dan Friedman, Andrew Kyle Lampinen, Lucas Dixon, Danqi Chen, Asma Ghandeharioun

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We illustrate this by training Transformer models on controlled datasets with systematic generalization splits, including the Dyck balanced-parenthesis languages and a code completion task. We simplify these models using tools like dimensionality reduction and clustering, and then explicitly test how these simplified proxies match the behavior of the original model. We find consistent generalization gaps: cases in which the simplified proxies are more faithful to the original model on the in-distribution evaluations and less faithful on various tests of systematic generalization.
Researcher Affiliation Collaboration Dan Friedman 1 * Andrew Lampinen 2 Lucas Dixon 3 Danqi Chen 1 2 Asma Ghandeharioun 3 * Work done while the author was a Student Researcher at Google Research. 1Department of Computer Science, Princeton University 2Google Deep Mind 3Google Research.
Pseudocode No No pseudocode or clearly labeled algorithm block found.
Open Source Code No No explicit statement about releasing source code for the methodology described in this paper or a direct link to a code repository.
Open Datasets Yes For our main analysis, we train models on Dyck-(20, 10), the language with 20 bracket types and a maximum depth of 10, following (Murty et al., 2023). To create generalization splits, we follow Murty et al. (2023) and start by sampling a training set with 200k training sentences using the distribution described by Hewitt et al. (2020) and then generate test sets with respect to this training set. ... We train character-level language models on the Code Search Net dataset (Husain et al., 2019), which is made up of functions in a variety of programming languages.
Dataset Splits Yes The training set contains 200k sentences and all the generalization sets contain 20k sentences. ... We draw the training examples from the original training split and the evaluation examples from the validation split.
Hardware Specification No No specific hardware details (e.g., GPU/CPU models, processor types, or memory amounts) used for running experiments were provided.
Software Dependencies No The model is implemented in JAX (Bradbury et al., 2018) and adapted from the Haiku Transformer (Hennigan et al., 2020). The paper cites the creation of the software/frameworks but does not provide specific version numbers of these or other key software dependencies used for implementation or experimentation.
Experiment Setup Yes Model and training details We train two-layer Transformer language models on the Dyck-(20, 10) training data described in the previous section. The model uses learned absolute positional embeddings. Each layer has one attention head, one MLP, and layer normalization, and the model has a hidden dimension of 32. Details about the model and training procedure are in Appendix A.2 and A.3. ... We train the models to minimize the cross entropy loss... We train the model for 100,000 steps with a batch size of 128 and use the final model for further analysis. We use the Adam W optimizer (Loshchilov & Hutter, 2019) with β1 = 0.9, β2 = 0.999, ϵ = 1e-7, and a weight decay factor of 1e-4. We set the learning rate to follow a linear warmup for the first 10,000 steps followed by a square root decay, with a maximum learning rate of 5e-3. We do not use dropout.