Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
What Happens During the Loss Plateau? Understanding Abrupt Learning in Transformers
Authors: Pulkit Gopalani, Wei Hu
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Training Transformers on algorithmic tasks frequently demonstrates an intriguing abrupt learning phenomenon: an extended performance plateau followed by a sudden, sharp improvement. This work investigates the underlying mechanisms for such dynamics, primarily in shallow Transformers. We reveal that during the plateau, the model often develops an interpretable partial solution while simultaneously exhibiting a strong repetition bias in their outputs. This output degeneracy is accompanied by internal representation collapse, where hidden states across different tokens become nearly parallel. We further identify the slow learning of optimal attention maps as a key bottleneck. Hidden progress in attention configuration during the plateau precedes the eventual rapid convergence, and directly intervening on attention significantly alters plateau duration and the severity of repetition bias and representational collapse. We validate that these identified phenomena repetition bias and representation collapse are not artifacts of toy setups but also manifest in the early pre-training stage of large language models like Pythia and OLMo. |
| Researcher Affiliation | Academia | Pulkit Gopalani University of Michigan, Ann Arbor EMAIL Wei Hu University of Michigan, Ann Arbor EMAIL |
| Pseudocode | No | The paper describes methods and procedures using prose and mathematical equations but does not include any explicitly labeled pseudocode blocks or algorithms. |
| Open Source Code | Yes | Code for experiments is available at github.com/pulkitgopalani/tf-loss-plateau. |
| Open Datasets | Yes | We use Pythia [6] / OLMo-2 [37] pretrained models (Apache 2.0 Licence) hosted on Huggingface Transformers [50] and evaluate them on the ARC-Easy / Challenge datasets [12] (CC-BY-SA 4.0 Licence), and GSM8K [13] (MIT Licence). |
| Dataset Splits | Yes | The training is conducted in an online / single-epoch fashion, where a new batch of 256 training samples is drawn from the data distribution at each training step. Note that in this setup, the training and test losses essentially coincide. For LLM Experiments... Specifically, we randomly sample 100 questions from the test split of the AI2 ARC-Easy dataset [12]. |
| Hardware Specification | Yes | All experiments were conducted on a single GPU (NVIDIA A100 or L40S) on an academic computing cluster. |
| Software Dependencies | No | We use the existing min GPT implementation [26] (MIT licence) for our experiments, modifying the code as above and wherever required. We use Pythia [6] / OLMo-2 [37] pretrained models (Apache 2.0 Licence) hosted on Huggingface Transformers [50]... implemented using scikit-learn MLPClassifier [35]. The paper mentions software such as min GPT, Huggingface Transformers, and scikit-learn but does not provide specific version numbers for these components, which are necessary for full reproducibility. |
| Experiment Setup | Yes | We use the Adam optimizer with a constant learning rate 10 4 and no weight decay. The training is conducted in an online / single-epoch fashion, where a new batch of 256 training samples is drawn from the data distribution at each training step. For generating sequences, we use greedy decoding i.e. output token is determined by the maximum logit over the vocabulary. We use learning rate 0.02 for Muon, and 1e 4 for Adam. We set the use_cache=False in the generate function, and use the hidden state used for predicting each of the 8 output tokens. For random sampling, we use do_sample=True (using default temperature value), using do_sample=False for our greedy decoding results. |