Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Less is More: Local Intrinsic Dimensions of Contextual Language Models

Authors: Benjamin Matthias Ruppik, Julius von Rohrscheidt, Carel van Niekerk, Michael Heck, Renato Vukovic, Shutong Feng, Hsien-chin Lin, Nurul Lubis, Bastian Rieck, Marcus Zibrowius, Milica Gasic

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we introduce a novel perspective based on the geometric properties of contextual latent embeddings to study the effects of training and fine-tuning. To that end, we measure the local dimensions of a contextual language model s latent space and analyze their shifts during training and fine-tuning. We show that the local dimensions provide insights into the model s training dynamics and generalization ability. Specifically, the mean of the local dimensions predicts when the model s training capabilities are exhausted, as exemplified in a dialogue state tracking task, overfitting, as demonstrated in an emotion recognition task, and grokking, as illustrated with an arithmetic task. Furthermore, our experiments suggest a practical heuristic: reductions in the mean local dimension tend to accompany and predict subsequent performance gains.
Researcher Affiliation	Academia	1 Faculty of Mathematics and Natural Sciences, Heinrich Heine University Düsseldorf, Germany 2 Institute of AI for Health, Helmholtz Munich, Germany 3 Technical University of Munich, Germany 4 University of Fribourg, Switzerland EMAIL EMAIL EMAIL
Pseudocode	Yes	Algorithm 1: COMPUTE LOCAL DIMENSION ESTIMATES
Open Source Code	Yes	Our code is available at https://github.com/aidos-lab/Topo_LLM_public and https://github. com/aidos-lab/grokking-via-lid.
Open Datasets	Yes	Multi WOZ2.1 (Eric et al., 2020): Human-human multi-domain dialogues; Schema-Guided Dialogue SGD (Rastogi et al., 2020): Human virtual assistant dialogues; Reddit: Reddit comments from the year 2022 mentioning Tesla, Inc.; ICLR 2024 Submissions: Titles and abstracts of ICLR 2024 papers collected by us; and Wikipedia: The Hugging Face wikitext-103-v1 corpus. ... All datasets are freely available under permissive licenses; more details can be found in the experiments in Section 4 and Appendix B.
Dataset Splits	Yes	We split Reddit and Wikipedia into training (80%), validation (10%), and test (10%) subsets. ... Further dataset statistics are available in Table 1. ... Table 1: Dataset sizes for various datasets used in the experiments. Some datasets use pre-determined splits, while others involve random splits.
Hardware Specification	Yes	Fine-tuning of the Ro BERTa-base and GPT-2-medium models (in Appendix C.1) on one of our selected datasets can be performed efficiently on a single NVIDIA V100 or GTX1080TI-12GB GPU within a few hours. Additionally, computing the embeddings for a single layer of these models requires only forward passes, which takes approximately 10 minutes on the same hardware. ... The computation is feasible in 20 minutes on an E5-2640v4 (Broadwell) 2.40GHz dual-core machine with 32GB of RAM using the scikit-dimension package (Bac et al., 2021) for a typical dataset with tens of thousands of points in high-dimensional space (ambient dimension in the hundreds).
Software Dependencies	No	The computation is feasible in 20 minutes on an E5-2640v4 (Broadwell) 2.40GHz dual-core machine with 32GB of RAM using the scikit-dimension package (Bac et al., 2021) for a typical dataset with tens of thousands of points in high-dimensional space (ambient dimension in the hundreds). ... Optimization is performed by Adam W (Loshchilov and Hutter, 2017) ... The Python package management system uv is used to reproduce the virtual environment with all dependencies, and run commands for the most important entry points into the codebase are provided.
Experiment Setup	Yes	Fine-tuning of the Ro BERTa-base models (Liu et al., 2019) is performed using masked language modeling with a masking probability of 0.15. Each model is trained for 5 epochs on 10 000 training examples using a batch size of 8, a learning rate peaking at 5 10 5 with 500 warmup steps, and linear decay thereafter. Weight decay of 0.01 is applied throughout. ... Optimization is performed by Adam W (Loshchilov and Hutter, 2017) with learning rate 0.001 (with linear schedule over 400k steps, warmup of 10 steps), batch size 512, weight decay 0.01. ... We train the models for 20 epochs with Adam, with a linear learning rate schedule up to 5 10 5 that starts with one warm-up epoch. ... The model is trained for 8 epochs using a linear learning rate schedule with warmup for Adam W and peak learning rate of 2 10 5.