Towards Understanding Grokking: An Effective Theory of Representation Learning

Authors: Ziming Liu, Ouail Kitouni, Niklas S Nolte, Eric Michaud, Max Tegmark, Mike Williams

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present both a microscopic analysis anchored by an effective theory and a macroscopic analysis of phase diagrams describing learning performance across hyperparameters. We find that generalization originates from structured representations whose training dynamics and dependence on training set size can be predicted by our effective theory in a toy setting. We observe empirically the presence of four learning phases: comprehension, grokking, memorization, and confusion.
Researcher Affiliation Academia Ziming Liu, Ouail Kitouni, Niklas Nolte, Eric J. Michaud, Max Tegmark, Mike Williams Department of Physics, Institute for AI and Fundamental Interactions, MIT {zmliu,kitouni,nnolte,ericjm,tegmark,mwill}@mit.edu
Pseudocode No The paper presents theoretical propositions and mathematical formulations (e.g., equations for effective loss), but it does not include any structured pseudocode or algorithm blocks.
Open Source Code Yes Project code can be found at: https://github.com/ejmichaud/grokking-squared
Open Datasets Yes Dataset In our toy setting, we are concerned with learning the addition operation... If i, j {0, . . . , p 1}, there are in total p(p + 1)/2 different samples... We denote the full dataset as D0 and split it into a training dataset D and a validation dataset D , i.e., D S D = D0, D T D = .
Dataset Splits Yes For training/validation spliting, we choose 45/10 for non-modular addition (p = 10) and 24/12 for the permutation group S3. We denote the full dataset as D0 and split it into a training dataset D and a validation dataset D , i.e., D S D = D0, D T D = .
Hardware Specification Yes All experiments were run on a workstation with two NVIDIA A6000 GPUs within a few days.
Software Dependencies No The paper mentions optimizers like 'Adam' and 'AdamW' but does not specify version numbers for any software components, libraries, or programming languages used.
Experiment Setup Yes For the 1D embeddings, we use the Adam optimizer with learning rate [10 5, 10 2] and zero weight decay. For the decoder, we use an Adam W optimizer with the learning rate in [10 5, 10 2] and the weight decay in [0, 10] (regression) or [0, 20] (classification). For training/validation spliting, we choose 45/10 for non-modular addition (p = 10) and 24/12 for the permutation group S3.