reproducibilityindex.ai

Omnigrok: Grokking Beyond Algorithmic Data

Authors: Ziming Liu, Eric J Michaud, Max Tegmark

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Guided by the intuitive picture, we are able to induce grokking on tasks involving images, language and molecules, although the grokking signals are sometimes less dramatic. We attribute the dramatic nature of grokking for algorithmic datasets to representation learning. ... To illustrate how the LU mechanism results in grokking, we employ a toy teacher-student setup. ... We now analyze loss landscapes and search for grokking for several more interesting datasets ... We report the main results here, with experiment details included in Appendix A.
Researcher Affiliation	Academia	Ziming Liu, Eric J. Michaud & Max Tegmark Department of Physics, Institute for AI and Fundamental Interactions, MIT {zmliu,ericjm,tegmark}@mit.edu
Pseudocode	No	The paper does not contain any pseudocode or algorithm blocks.
Open Source Code	Yes	Code is available at https://github.com/KindXiaoming/Omnigrok.
Open Datasets	Yes	We now analyze loss landscapes and search for grokking for several more interesting datasets... MNIST (Deng, 2012)... IMDb dataset (Maas et al., 2011)... QM9 dataset (Ramakrishnan et al., 2014).
Dataset Splits	No	The paper mentions 'training' and 'test' splits (e.g., 'We hold back 25% of the dataset for testing.', 'We split the dataset into 50/50 train/test.'), but does not explicitly mention a 'validation' split.
Hardware Specification	No	The paper does not provide specific details on the hardware used for experiments (e.g., CPU, GPU models, memory).
Software Dependencies	No	The paper mentions 'Py Torch' and 'Adam optimizer' but does not specify version numbers for any software dependencies.
Experiment Setup	Yes	The student network is trained with the Adam optimizer (learning rate 3 10 4) for 105 steps. ... We train width-200 depth-3 Re LU MLPs on the MNIST dataset with MSE loss. We use the Adam W optimizer with a learning rate of 0.001 and a batch size of 200. ... We use the Adam optimizer (Kingma and Ba, 2014) with learning rate 0.001 to minimize the binary cross entropy loss. ... We use the Adam optimizer with learning rate 0.001 to minimize the MSE loss.