Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning
Authors: Dan Braun, Jordan Taylor, Nicholas Goldowsky-Dill, Lee Sharkey
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Compared to standard SAEs, e2e SAEs offer a Pareto improvement: They explain more network performance, require fewer total features, and require fewer simultaneously active features per datapoint, all with no cost to interpretability. We explore geometric and qualitative differences between e2e SAE features and standard SAE features. |
| Researcher Affiliation | Collaboration | Dan Braun Jordan Taylor Nicholas Goldowsky-Dill Lee Sharkey Apollo Research ML Alignment & Theory Scholars (MATS), University of Queensland |
| Pseudocode | No | The paper describes the mathematical formulations of the SAE training losses (Llocal, Le2e, Le2e+downstream) in Section 2, but it does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | We release our library for training e2e SAEs and reproducing our analysis at https://github.com/Apollo Research/e2e_sae. |
| Open Datasets | Yes | For our GPT2-small experiments, we train SAEs with each type of loss function on 400k samples of context size 1024 from the Open Web Text dataset [Gokaslan and Cohen, 2019] over a range of sparsity coefficients λ.We stream the dataset https://huggingface.co/datasets/ apollo-research/Skylion007-openwebtext-tokenizer-gpt2 which is a tokenized version of Open Web Text ([Gokaslan and Cohen, 2019]) (released under the license CC0-1.0). |
| Dataset Splits | No | The paper mentions training data and evaluating models on 500 samples of the Open Web Text dataset, but it does not explicitly specify a distinct validation dataset split with its size or percentage, separate from a test set, for purposes like hyperparameter tuning. |
| Hardware Specification | Yes | We used NVIDIA A100 GPUs with 80GB VRAM (although the GPU was saturated when using smaller batch sizes that used 40GB VRAM or less). |
| Software Dependencies | No | The paper mentions using 'Adam optimizer', 'Transformer Lens library', and 'Hugging Face s Transformers library', but does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | We train for 400k samples of context size 1024 on Open Web Text with an effective batch size of 16. We use a learning rate of 5e 4, with a warmup of 20k samples, a cosine schedule decaying to 10% of the max learning rate, and the Adam optimizer [Kingma and Ba, 2017] with default hyperparameters. |