Towards Understanding Grokking: An Effective Theory of Representation Learning
Authors: Ziming Liu, Ouail Kitouni, Niklas S Nolte, Eric Michaud, Max Tegmark, Mike Williams
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present both a microscopic analysis anchored by an effective theory and a macroscopic analysis of phase diagrams describing learning performance across hyperparameters. We find that generalization originates from structured representations whose training dynamics and dependence on training set size can be predicted by our effective theory in a toy setting. We observe empirically the presence of four learning phases: comprehension, grokking, memorization, and confusion. |
| Researcher Affiliation | Academia | Ziming Liu, Ouail Kitouni, Niklas Nolte, Eric J. Michaud, Max Tegmark, Mike Williams Department of Physics, Institute for AI and Fundamental Interactions, MIT {zmliu,kitouni,nnolte,ericjm,tegmark,mwill}@mit.edu |
| Pseudocode | No | The paper presents theoretical propositions and mathematical formulations (e.g., equations for effective loss), but it does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Project code can be found at: https://github.com/ejmichaud/grokking-squared |
| Open Datasets | Yes | Dataset In our toy setting, we are concerned with learning the addition operation... If i, j {0, . . . , p 1}, there are in total p(p + 1)/2 different samples... We denote the full dataset as D0 and split it into a training dataset D and a validation dataset D , i.e., D S D = D0, D T D = . |
| Dataset Splits | Yes | For training/validation spliting, we choose 45/10 for non-modular addition (p = 10) and 24/12 for the permutation group S3. We denote the full dataset as D0 and split it into a training dataset D and a validation dataset D , i.e., D S D = D0, D T D = . |
| Hardware Specification | Yes | All experiments were run on a workstation with two NVIDIA A6000 GPUs within a few days. |
| Software Dependencies | No | The paper mentions optimizers like 'Adam' and 'AdamW' but does not specify version numbers for any software components, libraries, or programming languages used. |
| Experiment Setup | Yes | For the 1D embeddings, we use the Adam optimizer with learning rate [10 5, 10 2] and zero weight decay. For the decoder, we use an Adam W optimizer with the learning rate in [10 5, 10 2] and the weight decay in [0, 10] (regression) or [0, 20] (classification). For training/validation spliting, we choose 45/10 for non-modular addition (p = 10) and 24/12 for the permutation group S3. |