Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Grokking Beyond the Euclidean Norm of Model Parameters
Authors: Tikeng Notsawo Pascal Junior, Guillaume Dumas, Guillaume Rabusseau
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Building upon this theoretical foundation, we validate its implications both theoretically and empirically across various settings: sparsity (Theorem 3.1) and low-rankness (Theorem 3.4). For sparsity, we focus on a linear teacher-student setup and show that recovery of sparse vectors using gradient descent and Lasso exhibits a grokking phenomenon, which is impossible using only ℓ2 regularization, regardless of the initialization scale, as advocated by previous art (Lyu et al., 2023; Liu et al., 2023b). Moreover, we empirically show that in deep linear networks, the sparse/low-rank structure of the data is enough to have generalization without explicit regularization. Adding depth makes it possible to grok or ungrok simply from the implicit regularization of gradient descent. We demonstrate this on a nonlinear teacher-student setup, on the algorithmic data setup where grokking was first observed (Power et al., 2022), and on image classification tasks. |
| Researcher Affiliation | Academia | 1Universit e de Montr eal, Montr eal, Quebec, Canada 2Mila, Quebec AI Institute, Montr eal, Quebec, Canada 3CHU Sainte Justine Research Center, Montr eal, Quebec, Canada 4CIFAR AI Chair. |
| Pseudocode | No | The paper describes mathematical update rules and theoretical proofs, such as 'x(t+1) = x(t) α G(x(t)) + βH(x(t)) t 0 (1)', but does not contain clearly labeled pseudocode or algorithm blocks. The methods are described through equations and theoretical formulations. |
| Open Source Code | Yes | Our contributions can be summarized as follows1: (i) We show that grokking can be induced by the interplay between the sparse/low-rank structure of the solution and the ℓ1/ℓ regularization used during training, extending previous results on ℓ2 regularization (Lyu et al., 2023). Our theoretical results extend beyond these specific regularizations, as we characterize the relationship between grokking time, regularization strength, and learning rate in a general setting. 1Code to reproduce our experiments: https://github. com/Tikquuss/grokking_beyong_l2_norm. |
| Open Datasets | Yes | We demonstrate this on a nonlinear teacher-student setup, on the algorithmic data setup where grokking was first observed (Power et al., 2022), and on image classification tasks. We observe a similar phenomenon on a two-layer Re LU MLP trained on MNIST (Section H.3.4). |
| Dataset Splits | Yes | Consider a binary mathematical operator on S = Z/p Z for some prime integer p. We want to predict y (x) = x1 x2 given x = (x1, x2) S2. The dataset D = {(x, y (x)) | x S2} is randomly partitioned into two disjoint and non-empty sets Dtrain and Dval, the training and the validation dataset respectively. ... We train this model on addition modulo p = 97 with rtrain := |Dtrain|/|D| = 40%. |
| Hardware Specification | No | The authors acknowledge the material support of NVIDIA in the form of computational resources. |
| Software Dependencies | No | For the experiments of this section only, we used Adam as the optimizer, with its default parameters (as specified in Py Torch), except for the learning rate. |
| Experiment Setup | Yes | Validation Experiments Using (n, s, N, α, β) = (102, 5, 30, 10 1, 10 5), we observe a grokking-like pattern, where the training error Xa(t) y 2 first decreases to 10 6, then after a long training time, the recovery error a(t) a 2 decreases and matches the training error (Figure 3)... We set (n1, n2, r, N, α, β) = (10, 10, 2, 70, 10 1, 10 4), and optimize the noiseless matrix completion problem using subgradient descent... For the experiments of this section only, we used Adam as the optimizer, with its default parameters (as specified in Py Torch), except for the learning rate. |