Compact Proofs of Model Performance via Mechanistic Interpretability
Authors: Jason Gross, Rajashree Agrawal, Thomas Kwa, Euan Ong, Chun Hei Yip, Alex Gibson, Soufiane Noubir, Lawrence Chan
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We prototype this approach by formally proving accuracy lower bounds for a small transformer trained on Max-of K, validating proof transferability across 151 random seeds and four values of K. We create 102 different computer-assisted proof strategies and assess their length and tightness of bound on each of our models. Using quantitative metrics, we find that shorter proofs seem to require and provide more mechanistic understanding. |
| Researcher Affiliation | Academia | Jason Gross Rajashree Agrawal Thomas Kwa Euan Ong Chun Hei Yip Alex Gibson Soufiane Noubir Lawrence Chan Corresponding author. Please direct correspondence to jgross@mit.edu. |
| Pseudocode | Yes | Algorithm 1 Counting Correct Sequences By Brute Force |
| Open Source Code | Yes | Code: https://github.com/Jason Gross/guarantees-based-mechanistic-interpretability/ |
| Open Datasets | No | To train each model, we generate 384,000 random sequences of 4 integers picked uniformly at random, corresponding to less than 2.5% of the input distribution. |
| Dataset Splits | No | The paper describes generating training sequences but does not specify explicit train/validation/test dataset splits. It reports 'train accuracy' but no separate validation metrics or splits. |
| Hardware Specification | No | As our models as sufficiently small, we did not have to use any GPUs to accelerate training our inference. Each training run takes less than a single CPU-hour to complete. In total, the experiments in this paper took less than 1000 CPU-hours. |
| Software Dependencies | Yes | We use the following software packages in our work: Paszke et al. [41], Plotly Technologies Inc. [42], Nanda and Bloom [37], Rogozhnikov [47], Virtanen et al. [52], Mc Kinney [33], Waskom [55] specifically mentioning "SciPy 1.0" from [52]. |
| Experiment Setup | Yes | We set hidden dimension dmodel = 32 and a vocabulary of size dvocab = 64 comprising integers between 0 and 63 inclusive. ... We use Adam W with batch_size = 128, lr = 0.001, betas = (0.9, 0.999), weight_decay left at the default 0.01. We train for 1 epoch (3000 steps). |