reproducibilityindex.ai

Compact Proofs of Model Performance via Mechanistic Interpretability

Authors: Jason Gross, Rajashree Agrawal, Thomas Kwa, Euan Ong, Chun Hei Yip, Alex Gibson, Soufiane Noubir, Lawrence Chan

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We prototype this approach by formally proving accuracy lower bounds for a small transformer trained on Max-of K, validating proof transferability across 151 random seeds and four values of K. We create 102 different computer-assisted proof strategies and assess their length and tightness of bound on each of our models. Using quantitative metrics, we find that shorter proofs seem to require and provide more mechanistic understanding.
Researcher Affiliation	Academia	Jason Gross Rajashree Agrawal Thomas Kwa Euan Ong Chun Hei Yip Alex Gibson Soufiane Noubir Lawrence Chan Corresponding author. Please direct correspondence to jgross@mit.edu.
Pseudocode	Yes	Algorithm 1 Counting Correct Sequences By Brute Force
Open Source Code	Yes	Code: https://github.com/Jason Gross/guarantees-based-mechanistic-interpretability/
Open Datasets	No	To train each model, we generate 384,000 random sequences of 4 integers picked uniformly at random, corresponding to less than 2.5% of the input distribution.
Dataset Splits	No	The paper describes generating training sequences but does not specify explicit train/validation/test dataset splits. It reports 'train accuracy' but no separate validation metrics or splits.
Hardware Specification	No	As our models as sufficiently small, we did not have to use any GPUs to accelerate training our inference. Each training run takes less than a single CPU-hour to complete. In total, the experiments in this paper took less than 1000 CPU-hours.
Software Dependencies	Yes	We use the following software packages in our work: Paszke et al. [41], Plotly Technologies Inc. [42], Nanda and Bloom [37], Rogozhnikov [47], Virtanen et al. [52], Mc Kinney [33], Waskom [55] specifically mentioning "SciPy 1.0" from [52].
Experiment Setup	Yes	We set hidden dimension dmodel = 32 and a vocabulary of size dvocab = 64 comprising integers between 0 and 63 inclusive. ... We use Adam W with batch_size = 128, lr = 0.001, betas = (0.9, 0.999), weight_decay left at the default 0.01. We train for 1 epoch (3000 steps).