Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Emergence of a High-Dimensional Abstraction Phase in Language Transformers

Authors: Emily Cheng, Diego Doimo, Corentin Kervadec, Iuri Macocco, Lei Yu, Alessandro Laio, Marco Baroni

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We take a high-level geometric approach to its analysis, observing, across five pre-trained transformer-based LMs and three input datasets, a distinct phase characterized by high intrinsic dimensionality. During this phase, representations (1) correspond to the first full linguistic abstraction of the input; (2) are the first to viably transfer to downstream tasks; (3) predict each other across different LMs. Moreover, we find that an earlier onset of the phase strongly predicts better language modelling performance. In short, our results suggest that a central high-dimensionality phase underlies core linguistic processing in many common LM architectures.
Researcher Affiliation	Academia	Universitat Pompeu Fabra1, Area Science Park2, University of Toronto3, SISSA4, ICREA5 EMAIL
Pseudocode	No	The paper describes methods like GRIDE and Information Imbalance through textual explanations and equations (e.g., Equation 1) but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	https://github.com/chengemily1/id-llm-abstraction
Open Datasets	Yes	Bookcorpus (Zhu et al., 2015); the Pile (Gao et al. (2020); precisely, the 10k-document subsample available on Hugging Face) and WikiText-103 (Merity et al., 2017). From each corpus, we sample, without replacement, a total of 50k distinct 20-token sequences (sequence length is counted according to the number of typographic tokens in the text of each corpus).
Dataset Splits	Yes	From each corpus, we sample, without replacement, a total of 50k distinct 20-token sequences (sequence length is counted according to the number of typographic tokens in the text of each corpus).2 We then divide these samples into partitions of 10k sequences each, which we use for the experiments. ... To train the linear probes, we first divide the data at random into train (80%) and validation (20%) sets.
Hardware Specification	Yes	All experiments were run on a cluster with 12 nodes with 5 NVIDIA A30 GPUs and 48 CPUs each.
Software Dependencies	No	The paper mentions software like DadaPy, scikit-learn, and PyTorch (Appendix B), but it does not specify version numbers for these dependencies.
Experiment Setup	Yes	We fixed the following hyperparameters of the MLP, attempting to approximate those used in the original paper (as each task takes days to complete, we could not perform our own hyperparameter search): Number of layers: 1 Layer dimensionality: 200 Non-linearity: logistic L2 regularization coefficient: 0.0001 seeds: 1, 2, 3, 4, 5 ... To train the linear probes, we take a sample of size N = 25000 corresponding to the train split on Hugging Face. We repeat the experiment with 5 distinct seeds. ... Number of epochs: 1000 seeds: 32, 36, 42, 46, 52