Dichotomy of Early and Late Phase Implicit Biases Can Provably Induce Grokking
Authors: Kaifeng Lyu, Jikai Jin, Zhiyuan Li, Simon Shaolei Du, Jason D. Lee, Wei Hu
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This paper studies the grokking phenomenon in theoretical setups and shows that it can be induced by a dichotomy of early and late phase implicit biases. Specifically, when training homogeneous neural nets with large initialization and small weight decay on both classification and regression tasks, we prove that the training process gets trapped... Our Contributions. In this work, we address these limitations by identifying simple yet insightful theoretical setups where grokking with sharp transition can be rigorously proved and its mechanism can be intuitively understood... MOTIVATING EXPERIMENT: GROKKING IN MODULAR ADDITION. In this section, we provide experiments to show that the initialization scale and weight decay are important factors in the grokking phenomenon... Empirical Validation: Grokking. |
| Researcher Affiliation | Academia | Kaifeng Lyu Princeton University klyu@cs.princeton.edu Stanford University jkjin@stanford.edu Zhiyuan Li Toyota Technological Institute at Chicago zhiyuanli@ttic.edu Simon S. Du University of Washington ssdu@cs.washington.edu Jason D. Lee Princeton University jasonlee@princeton.edu Wei Hu University of Michigan vvh@umich.edu |
| Pseudocode | No | The paper contains mathematical equations and theoretical derivations but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | Yes | Code is available at https://github.com/vfleaking/grokking-dichotomy |
| Open Datasets | No | The paper uses custom-generated datasets for modular addition, sparse linear classification, and matrix completion. For modular addition: 'randomly split {(a, b, c) : a + b c (mod p)} into training and test sets, and train a neural net on the training set to predict c given input pair (a, b).' For sparse linear classification: 'sample n data points uniformly from { 1}d...'. For matrix completion: 'randomly choose 5% of the entries as the training set.' |
| Dataset Splits | No | The paper mentions splitting data into 'training and test sets' and specifies the 'training data size as 40% of the total number of data', but it does not provide details on a validation set or its split. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or cloud computing resources used for running the experiments. It only mentions the types of neural networks trained and optimizers used. |
| Software Dependencies | No | The paper mentions using a 'two-layer ReLU net with full-batch GD' and refers to optimizers like 'Adam, AdamW', but it does not specify any software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | The paper provides specific experimental setup details, including: 'width h as 1024, the learning rate as 0.002 and weight decay as 10 4.' and 'large initialization with initial parameter norm 128, and small weight decay λ = 0.001.' and 'initialization scale α = 10, learning rate η = 0.1, weight decay 10 4.' |