Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Grokking Beyond Neural Networks: An Empirical Exploration with Model Complexity

Authors: Jack William Miller, Charles O'Neill, Thang D Bui

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we conduct an empirical exploration of grokking, uncovering new aspects of the phenomenon not explained by current theory. We begin by describing grokking and summarise its existing explanations. Afterwards, we present our empirical observations most notably, the existence of grokking outside of neural networks. Finally, we suggest a mechanism for grokking that is broadly consistent with our observations.
Researcher Affiliation	Academia	Jack Miller EMAIL ANU College of Engineering, Computing and Cybernetics Charles O Neill EMAIL ANU College of Engineering, Computing and Cybernetics Thang Bui EMAIL ANU College of Engineering, Computing and Cybernetics
Pseudocode	Yes	The algorithm used to run the experiment is detailed in Algorithm 1 (Appendix I.1).
Open Source Code	Yes	All experiments can be found at this Git Hub page. They have descriptive names and should reproduce the figures seen in this paper. For Figure 6, the relevant experiment is in the feat/info-theory-description branch.
Open Datasets	No	Many datasets were used for the experimentation completed in this paper. They are were either found in Merrill et al. (2023), Power et al. (2022) or were developed independently.
Dataset Splits	No	The paper does not explicitly state the specific training/test/validation dataset splits (e.g., percentages or exact counts) used for the experiments. It refers to 'training points' and 'validation dataset' but without quantification of the split.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory, or cloud instance types) used for running its experiments.
Software Dependencies	No	The paper mentions optimizers like 'Adam optimiser' and 'SGD', but it does not specify any software libraries or frameworks with version numbers (e.g., Python, PyTorch, TensorFlow versions) that would be needed to replicate the experiment.
Experiment Setup	Yes	For the model, we used a simple neural network analogous to that of Merrill et al. (2023). This neural network consisted of 1 hidden layer of size 1000 and was optimised using SGD with cross-entropy loss. The weight decay was set to 10^-2 and the learning rate to 10^-1. Loss plots for all experiments are shown in Appendix N.