Progress measures for grokking via mechanistic interpretability

Authors: Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, Jacob Steinhardt

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We investigate the recently-discovered phenomenon of grokking exhibited by small transformers trained on modular addition tasks. We fully reverse engineer the algorithm learned by these networks, which uses discrete Fourier transforms and trigonometric identities to convert addition to rotation about a circle. We confirm the algorithm by analyzing the activations and weights and by performing ablations in Fourier space. Based on this understanding, we define progress measures that allow us to study the dynamics of training and split training into three continuous phases: memorization, circuit formation, and cleanup.
Researcher Affiliation Collaboration Neel Nanda , Lawrence Chan Tom Lieberum Jess Smith Jacob Steinhardt Neural networks often exhibit emergent behavior, where qualitatively new capabilities arise from scaling up the amount of parameters, training data, or training steps. Independent researcher. University of California, Berkeley.
Pseudocode No The paper does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes An annotated Colab notebook containing the code to replicate our results, including download instructions for model checkpoints, is available at https://bit.ly/ grokking-progress-measures-website.
Open Datasets No Our mainline dataset consists of 30% of the entire set of possible inputs (that is, 30% of the 113 113 pairs of numbers mod P). The paper describes how the dataset is generated but does not provide a direct link, DOI, or formal citation for public access.
Dataset Splits No The paper mentions 'Our mainline dataset consists of 30% of the entire set of possible inputs... We evaluate test loss and accuracy on all pairs of inputs not used for training.' This describes training and test splits, but no explicit validation set split is mentioned.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies No We trained our models using PyTorch (Paszke et al., 2019) and performed our data analysis using NumPy (Harris et al., 2020), Pandas (Wes McKinney, 2010), and einops (Rogozhnikov, 2022). Our figures were made using Plotly (Plotly Technologies Inc., 2015). The paper lists software but does not specify version numbers for these libraries/packages, only cites their original papers.
Experiment Setup Yes In our mainline experiment, we take P = 113 and use a one-layer Re LU transformer, token embeddings with d = 128, learned positional embeddings, 4 attention heads of dimension d/4 = 32, and n = 512 hidden units in the MLP. We did not use Layer Norm or tie our embed/unembed matrices. Our mainline dataset consists of 30% of the entire set of possible inputs (that is, 30% of the 113 113 pairs of numbers mod P). We use full batch gradient descent using the Adam W optimizer (Loshchilov & Hutter, 2017) with learning rate γ = 0.001 and weight decay parameter λ = 1. We perform 40,000 epochs of training.