Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Flatness is Necessary, Neural Collapse is Not: Rethinking Generalization via Grokking

Authors: Ting Han, Linara Adilova, Henning Petzka, Jens Kleesiek, Michael Kamp

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We disentangle these questions using grokking, a training regime in which memorization precedes generalization, allowing us to temporally separate generalization from training dynamics and we find that while both neural collapse and relative flatness emerge near the onset of generalization, only flatness consistently predicts it. Models encouraged to collapse or prevented from collapsing generalize equally well, whereas models regularized away from flat solutions exhibit delayed generalization, resembling grokking, even in architectures and datasets where it does not typically occur. Furthermore, we show theoretically that neural collapse leads to relative flatness under classical assumptions, explaining their empirical co-occurrence.
Researcher Affiliation Academia Ting Han Lamarr Institute, TU Dortmund, Germany and Institute for AI in Medicine, UK Essen EMAIL Linara Adilova Research Center Trustworthy Data Science and Security of the University Alliance Ruhr, TU Dortmund, Germany Henning Petzka Ruhr University Bochum, Germany Jens Kleesiek Institute for AI in Medicine, UK Essen and Department of Physics, TU Dortmund, Germany Michael Kamp Lamarr Institute, TU Dortmund, Germany and Institute for AI in Medicine, UK Essen EMAIL
Pseudocode No The paper describes theoretical propositions and their proofs, and experimental methodologies, but does not present any explicitly labeled 'Pseudocode' or 'Algorithm' blocks with structured steps.
Open Source Code Yes Code implementation: https://github.com/TrustworthyMachineLearning-Lab/grokking flatness.
Open Datasets Yes We train a Res Net-18 on CIFAR-10... We train Res Net-10 on CIFAR-10 and Vi T [Dosovitskiy et al., 2021] on Image Net-1003... https://www.kaggle.com/datasets/ambityga/imagenet100/data... fine-tune Tiny BERT [Jiao et al., 2020] and Distil GPT2 [Sanh et al., 2019] on SST-5
Dataset Splits Yes Each experiment runs for 10^6 steps with a 50/50 train/validation split... We evaluate our method on the CIFAR-10 dataset using the standard training (50,000 samples) and test (10,000 samples) splits... We evaluate the relative-flatness regularizer on the Image Net-100 dataset using the standard train/test split.
Hardware Specification Yes Experiments are conducted using Py Torch 2.4.1 on a single NVIDIA A100 GPU with 80GB of memory.
Software Dependencies Yes Experiments are conducted using Py Torch 2.4.1 on a single NVIDIA A100 GPU with 80GB of memory.
Experiment Setup Yes training a 2 layer-transformer using the Adam W optimizer with a learning rate of 10 4 and weight decay of 1.0... All models are trained using stochastic gradient descent (SGD) with a fixed learning rate of 0.01, a batch size of 64, and no weight decay. The regularization coefficient is set to λ = 10 3. Momentum is 0.9. Training is performed for 250 epochs... the coefficient of the relative-flatness regularizer is set to 0.01. During regularized training, the weight capping value is fixed at 50 and the penultimate representation is normalized (ϕ = 1). We introduce temperature parameter τ = 2 in the softmax... The coefficient of the regularizer is set to λ = 10 2. To induce delayed generalization, the regularizer is removed after epoch 150. We optimize with SGD using a fixed learning rate of 0.01, momentum of 0.9, no weight decay... The temperature τ is set to 2... and the weight capping is set to 150. Training runs for 300 epochs... The batch size is 256... The training hyperparameters are listed in Table 1.