Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Is Grokking a Computational Glass Relaxation?

Authors: Xiaotian Zhang, Yue Shang, Entao Yang, Ge Zhang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments in transformers on arithmetic tasks suggests that there is NO entropy barrier in the memorization-to-generalization transition of grokking, challenging previous theory that defines grokking as a first-order phase transition [1].
Researcher Affiliation Collaboration 1 Department of Physics, City University of Hong Kong, Hong Kong, China 2 Department of Physics and Astronomy, University of Pennsylvania, Philadelphia, PA, USA 3 Innovation Campus Delaware, Air Liquide, Newark, DE, USA
Pseudocode Yes The overall Wan D training procedure is summarized in Algorithm 1. ... The detailed procedure for obtaining smoothed accuracy is outlined in Algorithm 2. ... The overall WLMD entropy sampling procedure is outlined in Algorithm 3.
Open Source Code Yes Code is available at https://github.com/xtzhang28/Grokking.
Open Datasets No The data consists of two parts: prompts x y p and answer x y . In the three tasks we study, the answers are determined by the following equations: x y = x + y mod p for 0 x, y < p x y = x2 + y mod p for 0 x, y < p x y = x3 + xy2 + y mod p for 0 x, y < p Here, x, y, and p are all natural numbers. In this paper, p is set to a prime number 67. The entire dataset will be divided into a training set and a test set with a 50% fraction, by a fixed random seed. In the Appendix Figure 4, we provide a visual image of the full dataset.
Dataset Splits Yes The entire dataset will be divided into a training set and a test set with a 50% fraction, by a fixed random seed.
Hardware Specification Yes We ran 8 WLMD processes on 4 Ge Force RTX 4090 GPUs, requiring approximately 100 GPU hours per 1M epochs. Additional computational details are provided in Appendix A.2. ... For each entropy sampling task, we use 4 NVIDIA H100 GPUs to run 8 processes, totaling 60M epochs and consume nearly 4,000 GPU hours.
Software Dependencies No The paper mentions GPUs like Ge Force RTX 4090 and NVIDIA H100, which implies the use of related software like CUDA/cuDNN and deep learning frameworks such as PyTorch or TensorFlow. However, no specific version numbers for these software dependencies are provided in the paper.
Experiment Setup Yes We use the same one-layer transformer model for the programs of Adam W optimizer training, Wan D optimizer training, and WLMD entropy sampling. This transformer model has 4 attention heads by default, each with a width of 32, yielding a model dimension of 128, and attached with an MLP layer with a width of 512. In the experiment of eliminating grokking by controlling the weight norm, we modified the attention head width (8/16/32) and the MLP layer width (128/256/512). ... The hyperparameters for training with Adam W are given in Table 1. ... Table 1: Adam W training hyperparameters Operation Model Dimension MLP Width Learning Rate Weight Decay Weight Norm x + y 32 128 1e-3 0 Fixed at 30 x + y 64 256 1e-3 0 Fixed at 30 x + y 128 512 1e-3 0 Fixed at 30 x + y 128 512 3e-3 1 Not fixed x2 + y 128 512 5e-3 1 Not fixed x3 + xy2 + y 128 512 1e-2 1 Not fixed