Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Grokking Beyond the Euclidean Norm of Model Parameters

Authors: Tikeng Notsawo Pascal Junior, Guillaume Dumas, Guillaume Rabusseau

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Building upon this theoretical foundation, we validate its implications both theoretically and empirically across various settings: sparsity (Theorem 3.1) and low-rankness (Theorem 3.4). For sparsity, we focus on a linear teacher-student setup and show that recovery of sparse vectors using gradient descent and Lasso exhibits a grokking phenomenon, which is impossible using only ℓ2 regularization, regardless of the initialization scale, as advocated by previous art (Lyu et al., 2023; Liu et al., 2023b). Moreover, we empirically show that in deep linear networks, the sparse/low-rank structure of the data is enough to have generalization without explicit regularization. Adding depth makes it possible to grok or ungrok simply from the implicit regularization of gradient descent. We demonstrate this on a nonlinear teacher-student setup, on the algorithmic data setup where grokking was first observed (Power et al., 2022), and on image classification tasks.
Researcher Affiliation	Academia	1Universit e de Montr eal, Montr eal, Quebec, Canada 2Mila, Quebec AI Institute, Montr eal, Quebec, Canada 3CHU Sainte Justine Research Center, Montr eal, Quebec, Canada 4CIFAR AI Chair.
Pseudocode	No	The paper describes mathematical update rules and theoretical proofs, such as 'x(t+1) = x(t) α G(x(t)) + βH(x(t)) t 0 (1)', but does not contain clearly labeled pseudocode or algorithm blocks. The methods are described through equations and theoretical formulations.
Open Source Code	Yes	Our contributions can be summarized as follows1: (i) We show that grokking can be induced by the interplay between the sparse/low-rank structure of the solution and the ℓ1/ℓ regularization used during training, extending previous results on ℓ2 regularization (Lyu et al., 2023). Our theoretical results extend beyond these specific regularizations, as we characterize the relationship between grokking time, regularization strength, and learning rate in a general setting. 1Code to reproduce our experiments: https://github. com/Tikquuss/grokking_beyong_l2_norm.
Open Datasets	Yes	We demonstrate this on a nonlinear teacher-student setup, on the algorithmic data setup where grokking was first observed (Power et al., 2022), and on image classification tasks. We observe a similar phenomenon on a two-layer Re LU MLP trained on MNIST (Section H.3.4).
Dataset Splits	Yes	Consider a binary mathematical operator on S = Z/p Z for some prime integer p. We want to predict y (x) = x1 x2 given x = (x1, x2) S2. The dataset D = {(x, y (x)) \| x S2} is randomly partitioned into two disjoint and non-empty sets Dtrain and Dval, the training and the validation dataset respectively. ... We train this model on addition modulo p = 97 with rtrain := \|Dtrain\|/\|D\| = 40%.
Hardware Specification	No	The authors acknowledge the material support of NVIDIA in the form of computational resources.
Software Dependencies	No	For the experiments of this section only, we used Adam as the optimizer, with its default parameters (as specified in Py Torch), except for the learning rate.
Experiment Setup	Yes	Validation Experiments Using (n, s, N, α, β) = (102, 5, 30, 10 1, 10 5), we observe a grokking-like pattern, where the training error Xa(t) y 2 first decreases to 10 6, then after a long training time, the recovery error a(t) a 2 decreases and matches the training error (Figure 3)... We set (n1, n2, r, N, α, β) = (10, 10, 2, 70, 10 1, 10 4), and optimize the noiseless matrix completion problem using subgradient descent... For the experiments of this section only, we used Adam as the optimizer, with its default parameters (as specified in Py Torch), except for the learning rate.