Transcoders find interpretable LLM feature circuits
Authors: Jacob Dunefsky, Philippe Chlenski, Neel Nanda
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We then successfully train transcoders on language models with 120M, 410M, and 1.4B parameters, and find them to perform at least on par with SAEs in terms of sparsity, faithfulness, and humaninterpretability. Finally, we apply transcoders to reverse-engineer unknown circuits in the model, and we obtain novel insights regarding the greater-than circuit in GPT2-small. |
| Researcher Affiliation | Academia | Jacob Dunefsky Yale University New Haven, CT 06511 jacob.dunefsky@yale.edu Philippe Chlenski Columbia University New York, NY 10027 pac@cs.columbia.edu |
| Pseudocode | Yes | Algorithm 1 Greedy computational-path-finding Algorithm 2 Paths-to-graph |
| Open Source Code | Yes | Code is available at https://github.com/jacobdunefsky/transcoder_circuits/. |
| Open Datasets | Yes | We evaluated each SAE and transcoder on the same 3.2M tokens of Open Web Text data [21]. Data Open Web Text Hugging Face: Open Web Text CC0-1.0 [21] |
| Dataset Splits | No | The paper mentions training on 60 million tokens and evaluating on 3.2 million tokens from the Open Web Text dataset, but does not specify explicit train/validation/test dataset splits (e.g., percentages or counts for each split). |
| Hardware Specification | Yes | The SAEs and transcoders from Section 4.2 were trained on an internal cluster using an A100 GPU with 80 GB of VRAM. |
| Software Dependencies | No | Appendix A lists 'Transformer Lens' and 'SAELens' as assets used, but does not provide specific version numbers for these software components, nor for Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | All SAEs and transcoders were trained with a learning rate of 2 10 5 using the Adam optimizer. The batch size was 4096 examples per batch. The same random seed (42) was used to initialize all SAEs and transcoders during the training process. |