Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Geometry of Decision Making in Language Models

Authors: Abhinav Joshi, Divyanshu Bhatt, Ashutosh Modi

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We perform a large-scale study, with 28 open-weight transformer models and estimate ID across layers using multiple estimators, while also quantifying per-layer performance on MCQA tasks. Our findings reveal a consistent ID pattern across models: early layers operate on low-dimensional manifolds, middle layers expand this space, and later layers compress it again, converging to decision-relevant representations.
Researcher Affiliation	Collaboration	Abhinav Joshi Divyanshu Bhatt Ashutosh Modi Indian Institute of Technology Kanpur (IIT Kanpur) Indian Institute of Technology Hyderabad (IIT Hyderabad) Samsung Research & Development Institute, Bangalore EMAIL, EMAIL
Pseudocode	No	The paper describes mathematical formulations for intrinsic dimension estimators (MLE, Two NN, GRIDE) in Section 3 and Appendix A, but it does not include a dedicated 'Pseudocode' or 'Algorithm' block, nor does it present structured, code-like steps for any procedure.
Open Source Code	Yes	In a nutshell, our study covers both real-world benchmarks and template-based tasks, enabling us to characterize representational dynamics across a diverse range of reasoning and language understanding skills. We release the codebase and results at https://github.com/Exploration-Lab/ dim-discovery-archive.
Open Datasets	Yes	For linguistic abilities, we consider the widely used Co LA dataset Warstadt et al. [2018] that contains English sentences from 23 linguistics publications... For world topic knowledge, we make use of AG News dataset Zhang et al. [2016]... MMLU Hendrycks et al. [2021] is another widely used benchmark... For this ability, we consider two known datasets, Rotten Tomatoes Pang and Lee [2005] and SST2 Socher et al. [2013]... We use COPA Gordon et al. [2012], and a small sample from the recently introduced COLD dataset Joshi et al. [2024].
Dataset Splits	No	The paper uses several standard datasets (e.g., Co LA, AG News, MMLU, Rotten Tomatoes, SST2, COPA, COLD) and extracts representations from them for analysis. However, it does not explicitly provide specific details about training, validation, or test dataset splits used for their evaluation of pre-trained models. While these datasets often have predefined splits, the paper does not mention using them or define any specific splitting methodology for its experiments.
Hardware Specification	Yes	We perform all the experiments using a machine with 1 NVIDIA A40 GPU.
Software Dependencies	No	We make use of Transformer-Lens [Nanda and Bloom, 2022] for saving the corresponding representations. The paper does not specify version numbers for other key software components like Python or deep learning frameworks.
Experiment Setup	Yes	Table 1: ID estimators hyperparameters Method Parameter Values MLE k all values in range [12, 24] MLE-Modified k all values in range [12, 24] Two NN Discard Ratio 0.1 GRIDE n1, n2 20, 40