Optimal ablation for interpretability

Authors: Maximilian Li, Lucas Janson

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We then show that using OA produces meaningful improvements for several common downstream applications of measuring component importance. In section 3, we apply OA to algorithmic circuit discovery (Conmy et al., 2023)... In section 4, we use OA to locate relevant components for factual recall (Meng et al., 2022)... In section 5, we apply OA to latent prediction (Belrose et al., 2023a)...
Researcher Affiliation Academia Maximilian Li Harvard University Lucas Janson Harvard University
Pseudocode Yes Algorithm 1 Uniform gradient sampling
Open Source Code Yes All code can be found at https://github.com/maxtli/optimalablation.
Open Datasets Yes The Indirect Object Identification (IOI) subtask (Wang et al., 2022)...
Dataset Splits No We train OAT on 60% of the dataset and evaluate both methods on the other 40%.
Hardware Specification Yes All experiments were run on a single Nvidia A100 GPU with 80GB VRAM.
Software Dependencies No The paper does not provide specific software dependency versions (e.g., Python 3.x, PyTorch 1.y, CUDA z.w).
Experiment Setup Yes We use learning rates between 0.01 and 0.15 for the sampling parameters. We use a learning rate of 0.002 for a...