Dichotomy of Early and Late Phase Implicit Biases Can Provably Induce Grokking

Authors: Kaifeng Lyu, Jikai Jin, Zhiyuan Li, Simon Shaolei Du, Jason D. Lee, Wei Hu

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This paper studies the grokking phenomenon in theoretical setups and shows that it can be induced by a dichotomy of early and late phase implicit biases. Specifically, when training homogeneous neural nets with large initialization and small weight decay on both classification and regression tasks, we prove that the training process gets trapped... Our Contributions. In this work, we address these limitations by identifying simple yet insightful theoretical setups where grokking with sharp transition can be rigorously proved and its mechanism can be intuitively understood... MOTIVATING EXPERIMENT: GROKKING IN MODULAR ADDITION. In this section, we provide experiments to show that the initialization scale and weight decay are important factors in the grokking phenomenon... Empirical Validation: Grokking.
Researcher Affiliation Academia Kaifeng Lyu Princeton University klyu@cs.princeton.edu Stanford University jkjin@stanford.edu Zhiyuan Li Toyota Technological Institute at Chicago zhiyuanli@ttic.edu Simon S. Du University of Washington ssdu@cs.washington.edu Jason D. Lee Princeton University jasonlee@princeton.edu Wei Hu University of Michigan vvh@umich.edu
Pseudocode No The paper contains mathematical equations and theoretical derivations but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes Code is available at https://github.com/vfleaking/grokking-dichotomy
Open Datasets No The paper uses custom-generated datasets for modular addition, sparse linear classification, and matrix completion. For modular addition: 'randomly split {(a, b, c) : a + b c (mod p)} into training and test sets, and train a neural net on the training set to predict c given input pair (a, b).' For sparse linear classification: 'sample n data points uniformly from { 1}d...'. For matrix completion: 'randomly choose 5% of the entries as the training set.'
Dataset Splits No The paper mentions splitting data into 'training and test sets' and specifies the 'training data size as 40% of the total number of data', but it does not provide details on a validation set or its split.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or cloud computing resources used for running the experiments. It only mentions the types of neural networks trained and optimizers used.
Software Dependencies No The paper mentions using a 'two-layer ReLU net with full-batch GD' and refers to optimizers like 'Adam, AdamW', but it does not specify any software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes The paper provides specific experimental setup details, including: 'width h as 1024, the learning rate as 0.002 and weight decay as 10 4.' and 'large initialization with initial parameter norm 128, and small weight decay λ = 0.001.' and 'initialization scale α = 10, learning rate η = 0.1, weight decay 10 4.'