reproducibilityindex.ai

Transcoders find interpretable LLM feature circuits

Authors: Jacob Dunefsky, Philippe Chlenski, Neel Nanda

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We then successfully train transcoders on language models with 120M, 410M, and 1.4B parameters, and find them to perform at least on par with SAEs in terms of sparsity, faithfulness, and humaninterpretability. Finally, we apply transcoders to reverse-engineer unknown circuits in the model, and we obtain novel insights regarding the greater-than circuit in GPT2-small.
Researcher Affiliation	Academia	Jacob Dunefsky Yale University New Haven, CT 06511 jacob.dunefsky@yale.edu Philippe Chlenski Columbia University New York, NY 10027 pac@cs.columbia.edu
Pseudocode	Yes	Algorithm 1 Greedy computational-path-finding Algorithm 2 Paths-to-graph
Open Source Code	Yes	Code is available at https://github.com/jacobdunefsky/transcoder_circuits/.
Open Datasets	Yes	We evaluated each SAE and transcoder on the same 3.2M tokens of Open Web Text data [21]. Data Open Web Text Hugging Face: Open Web Text CC0-1.0 [21]
Dataset Splits	No	The paper mentions training on 60 million tokens and evaluating on 3.2 million tokens from the Open Web Text dataset, but does not specify explicit train/validation/test dataset splits (e.g., percentages or counts for each split).
Hardware Specification	Yes	The SAEs and transcoders from Section 4.2 were trained on an internal cluster using an A100 GPU with 80 GB of VRAM.
Software Dependencies	No	Appendix A lists 'Transformer Lens' and 'SAELens' as assets used, but does not provide specific version numbers for these software components, nor for Python, PyTorch, or CUDA.
Experiment Setup	Yes	All SAEs and transcoders were trained with a learning rate of 2 10 5 using the Adam optimizer. The batch size was 4096 examples per batch. The same random seed (42) was used to initialize all SAEs and transcoders during the training process.