Transcoders find interpretable LLM feature circuits

Authors: Jacob Dunefsky, Philippe Chlenski, Neel Nanda

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We then successfully train transcoders on language models with 120M, 410M, and 1.4B parameters, and find them to perform at least on par with SAEs in terms of sparsity, faithfulness, and humaninterpretability. Finally, we apply transcoders to reverse-engineer unknown circuits in the model, and we obtain novel insights regarding the greater-than circuit in GPT2-small.
Researcher Affiliation Academia Jacob Dunefsky Yale University New Haven, CT 06511 jacob.dunefsky@yale.edu Philippe Chlenski Columbia University New York, NY 10027 pac@cs.columbia.edu
Pseudocode Yes Algorithm 1 Greedy computational-path-finding Algorithm 2 Paths-to-graph
Open Source Code Yes Code is available at https://github.com/jacobdunefsky/transcoder_circuits/.
Open Datasets Yes We evaluated each SAE and transcoder on the same 3.2M tokens of Open Web Text data [21]. Data Open Web Text Hugging Face: Open Web Text CC0-1.0 [21]
Dataset Splits No The paper mentions training on 60 million tokens and evaluating on 3.2 million tokens from the Open Web Text dataset, but does not specify explicit train/validation/test dataset splits (e.g., percentages or counts for each split).
Hardware Specification Yes The SAEs and transcoders from Section 4.2 were trained on an internal cluster using an A100 GPU with 80 GB of VRAM.
Software Dependencies No Appendix A lists 'Transformer Lens' and 'SAELens' as assets used, but does not provide specific version numbers for these software components, nor for Python, PyTorch, or CUDA.
Experiment Setup Yes All SAEs and transcoders were trained with a learning rate of 2 10 5 using the Adam optimizer. The batch size was 4096 examples per batch. The same random seed (42) was used to initialize all SAEs and transcoders during the training process.