Omnigrok: Grokking Beyond Algorithmic Data
Authors: Ziming Liu, Eric J Michaud, Max Tegmark
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Guided by the intuitive picture, we are able to induce grokking on tasks involving images, language and molecules, although the grokking signals are sometimes less dramatic. We attribute the dramatic nature of grokking for algorithmic datasets to representation learning. ... To illustrate how the LU mechanism results in grokking, we employ a toy teacher-student setup. ... We now analyze loss landscapes and search for grokking for several more interesting datasets ... We report the main results here, with experiment details included in Appendix A. |
| Researcher Affiliation | Academia | Ziming Liu, Eric J. Michaud & Max Tegmark Department of Physics, Institute for AI and Fundamental Interactions, MIT {zmliu,ericjm,tegmark}@mit.edu |
| Pseudocode | No | The paper does not contain any pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available at https://github.com/KindXiaoming/Omnigrok. |
| Open Datasets | Yes | We now analyze loss landscapes and search for grokking for several more interesting datasets... MNIST (Deng, 2012)... IMDb dataset (Maas et al., 2011)... QM9 dataset (Ramakrishnan et al., 2014). |
| Dataset Splits | No | The paper mentions 'training' and 'test' splits (e.g., 'We hold back 25% of the dataset for testing.', 'We split the dataset into 50/50 train/test.'), but does not explicitly mention a 'validation' split. |
| Hardware Specification | No | The paper does not provide specific details on the hardware used for experiments (e.g., CPU, GPU models, memory). |
| Software Dependencies | No | The paper mentions 'Py Torch' and 'Adam optimizer' but does not specify version numbers for any software dependencies. |
| Experiment Setup | Yes | The student network is trained with the Adam optimizer (learning rate 3 10 4) for 105 steps. ... We train width-200 depth-3 Re LU MLPs on the MNIST dataset with MSE loss. We use the Adam W optimizer with a learning rate of 0.001 and a batch size of 200. ... We use the Adam optimizer (Kingma and Ba, 2014) with learning rate 0.001 to minimize the binary cross entropy loss. ... We use the Adam optimizer with learning rate 0.001 to minimize the MSE loss. |