Deep Networks Always Grok and Here is Why
Authors: Ahmed Imtiaz Humayun, Randall Balestriero, Richard Baraniuk
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate that grokking is actually much more widespread and materializes in a wide range of practical settings, such as training of a convolutional neural network (CNN) on CIFAR10 or a Resnet on Imagenette. |
| Researcher Affiliation | Academia | 1Rice University 2Brown University. Correspondence to: Ahmed Imtiaz Humayun <imtiaz@rice.edu>. |
| Pseudocode | No | No pseudocode or algorithm blocks were found in the paper. |
| Open Source Code | No | The paper provides a web link 'bit.ly/grok-adversarial' but does not explicitly state that it hosts the open-source code for the described methodology. It's a general web link, not a direct repository link with an explicit code release statement. |
| Open Datasets | Yes | We make this observation for a number of training settings including for fully connected networks trained on MNIST (Figure 2), Convolutional Neural Networks (CNNs) trained on CIFAR10 and CIFAR100 (Figure 6), Res Net18 without batch-normalization, trained on CIFAR10 (Figure 1) and Imagenette (Figure 6), and a GPT-based Architecture trained on Shakespeare Text (Figure 9). |
| Dataset Splits | Yes | For all experiments we sample 1024 train test and random points for local complexity (LC) computation, except for the MNIST experiments, where we use 1000 training points (all of the training set where applicable) and 10000 test and random points for LC computation. |
| Hardware Specification | Yes | Computing LC 1000 samples takes approx. 28s on an RTX 8000. |
| Software Dependencies | No | No specific software versions (e.g., Python, PyTorch, CUDA versions) were mentioned in the paper. |
| Experiment Setup | Yes | We generate adversarial examples after each training step using ℓ -PGD with varying ϵ {0.03, 0.06, 0.10, 0.13, 0.16, 0.20}, α = 0.0156 and 10 (100 for MNIST) PGD steps. For training, we use the Adam optimizer and a weight decay of 0 for all the experiments except for the MNIST-MLP experiments where we use a weight decay of 0.01. Unless specified, we use CNNs with 5 convolutional layers and two linear layers. For the Res Net18 experiments with CIFAR10, we use a pre-activation architecture with width 16. For the Imagenette experiments, we use the standard torchvision Resnet architecture. For all settings we do not use Batch Normalizaiton, as reasoned in Appendix B. |