Compact Proofs of Model Performance via Mechanistic Interpretability

Authors: Jason Gross, Rajashree Agrawal, Thomas Kwa, Euan Ong, Chun Hei Yip, Alex Gibson, Soufiane Noubir, Lawrence Chan

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We prototype this approach by formally proving accuracy lower bounds for a small transformer trained on Max-of K, validating proof transferability across 151 random seeds and four values of K. We create 102 different computer-assisted proof strategies and assess their length and tightness of bound on each of our models. Using quantitative metrics, we find that shorter proofs seem to require and provide more mechanistic understanding.
Researcher Affiliation Academia Jason Gross Rajashree Agrawal Thomas Kwa Euan Ong Chun Hei Yip Alex Gibson Soufiane Noubir Lawrence Chan Corresponding author. Please direct correspondence to jgross@mit.edu.
Pseudocode Yes Algorithm 1 Counting Correct Sequences By Brute Force
Open Source Code Yes Code: https://github.com/Jason Gross/guarantees-based-mechanistic-interpretability/
Open Datasets No To train each model, we generate 384,000 random sequences of 4 integers picked uniformly at random, corresponding to less than 2.5% of the input distribution.
Dataset Splits No The paper describes generating training sequences but does not specify explicit train/validation/test dataset splits. It reports 'train accuracy' but no separate validation metrics or splits.
Hardware Specification No As our models as sufficiently small, we did not have to use any GPUs to accelerate training our inference. Each training run takes less than a single CPU-hour to complete. In total, the experiments in this paper took less than 1000 CPU-hours.
Software Dependencies Yes We use the following software packages in our work: Paszke et al. [41], Plotly Technologies Inc. [42], Nanda and Bloom [37], Rogozhnikov [47], Virtanen et al. [52], Mc Kinney [33], Waskom [55] specifically mentioning "SciPy 1.0" from [52].
Experiment Setup Yes We set hidden dimension dmodel = 32 and a vocabulary of size dvocab = 64 comprising integers between 0 and 63 inclusive. ... We use Adam W with batch_size = 128, lr = 0.001, betas = (0.9, 0.999), weight_decay left at the default 0.01. We train for 1 epoch (3000 steps).