Deep Networks Always Grok and Here is Why

Authors: Ahmed Imtiaz Humayun, Randall Balestriero, Richard Baraniuk

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate that grokking is actually much more widespread and materializes in a wide range of practical settings, such as training of a convolutional neural network (CNN) on CIFAR10 or a Resnet on Imagenette.
Researcher Affiliation Academia 1Rice University 2Brown University. Correspondence to: Ahmed Imtiaz Humayun <imtiaz@rice.edu>.
Pseudocode No No pseudocode or algorithm blocks were found in the paper.
Open Source Code No The paper provides a web link 'bit.ly/grok-adversarial' but does not explicitly state that it hosts the open-source code for the described methodology. It's a general web link, not a direct repository link with an explicit code release statement.
Open Datasets Yes We make this observation for a number of training settings including for fully connected networks trained on MNIST (Figure 2), Convolutional Neural Networks (CNNs) trained on CIFAR10 and CIFAR100 (Figure 6), Res Net18 without batch-normalization, trained on CIFAR10 (Figure 1) and Imagenette (Figure 6), and a GPT-based Architecture trained on Shakespeare Text (Figure 9).
Dataset Splits Yes For all experiments we sample 1024 train test and random points for local complexity (LC) computation, except for the MNIST experiments, where we use 1000 training points (all of the training set where applicable) and 10000 test and random points for LC computation.
Hardware Specification Yes Computing LC 1000 samples takes approx. 28s on an RTX 8000.
Software Dependencies No No specific software versions (e.g., Python, PyTorch, CUDA versions) were mentioned in the paper.
Experiment Setup Yes We generate adversarial examples after each training step using ℓ -PGD with varying ϵ {0.03, 0.06, 0.10, 0.13, 0.16, 0.20}, α = 0.0156 and 10 (100 for MNIST) PGD steps. For training, we use the Adam optimizer and a weight decay of 0 for all the experiments except for the MNIST-MLP experiments where we use a weight decay of 0.01. Unless specified, we use CNNs with 5 convolutional layers and two linear layers. For the Res Net18 experiments with CIFAR10, we use a pre-activation architecture with width 16. For the Imagenette experiments, we use the standard torchvision Resnet architecture. For all settings we do not use Batch Normalizaiton, as reasoned in Appendix B.