Grokking as the transition from lazy to rich training dynamics
Authors: Tanishq Kumar, Blake Bordelon, Samuel J. Gershman, Cengiz Pehlevan
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We study the simple setting of vanilla gradient descent on a polynomial regression problem with a two layer neural network which exhibits grokking without regularization in a way that cannot be explained by existing theories. We conclude with evidence that this transition from lazy (linear model) to rich training (feature learning) can control grokking in more general settings, like on MNIST, one-layer Transformers, and student-teacher networks. |
| Researcher Affiliation | Academia | Tanishq Kumar Blake Bordelon Samuel J. Gershman Cengiz Pehlevan Harvard University |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not include any explicit statements about releasing source code or provide links to a code repository. |
| Open Datasets | Yes | We conclude with evidence that this transition from lazy (linear model) to rich training (feature learning) can control grokking in more general settings, like on MNIST, one-layer Transformers, and student-teacher networks. We use 90% of all p2 possible pairs for training, and the rest of the test; the learning rate is η = 100. |
| Dataset Splits | No | The paper frequently refers to 'train loss' and 'test loss' but does not explicitly mention a 'validation set' or provide details on how a validation split was used for hyperparameter tuning or early stopping. |
| Hardware Specification | No | No specific hardware details (GPU/CPU models, memory amounts) are mentioned. The paper discusses experiments but does not specify the computational resources used. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers (e.g., library names like PyTorch, TensorFlow, or specific Python versions). It mentions using Adam W optimizer but no version. |
| Experiment Setup | Yes | We use 90% of all p2 possible pairs for training, and the rest of the test; the learning rate is η = 100. Crucially, we do not use any weight decay, and merely use vanilla gradient descent. |