Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

What are you sinking? A geometric approach on attention sink

Authors: Valeria Ruscio, Umberto Nanni, Fabrizio Silvestri

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 4 Methodology, 5 Analysis Results, Experimental design We analyzed a diverse set of transformer models to investigate architecturespecific and architecture-invariant patterns in reference frame formation: decoder-only models LLa MA-3.2 (1B, 3B) and 3.1 (8B-Instruct, 8B), Phi-2, Qwen-2.5 (3B, 7B, 7B-Instruct), Mistral-7Bv0.1, Gemma-7B, Pythia (1.4B, 2.8B, 6.9B, 12B); and encoder-only models BERT-base-uncased, XLM-Ro BERTa-large. For topological, spectral graph, value space and KL divergence analyses, we used a dataset of STEM-focused Wikipedia sentences (mathematics, chemistry, medicine, physics) ranging from 6 to 50 tokens. We processed 500 samples for topology, spectral and Fisher information analysis, and a subset of 50 samples for KL divergence analysis. For the temporal RMT analysis, we examined 100 samples across training checkpoints of Pythia models. All experiments were conducted using Google Colab with T4 or A100 GPUs.
Researcher Affiliation Academia Valeria Ruscio, Umberto Nanni, Fabrizio Silvestri Sapienza University of Rome EMAIL, EMAIL
Pseudocode No The paper describes methodologies in prose, such as in Section 4 "Methodology" and its subsections, but does not present any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes The code is available in the supplementary materials.
Open Datasets Yes For topological, spectral graph, value space and KL divergence analyses, we used a dataset of STEM-focused Wikipedia sentences (mathematics, chemistry, medicine, physics) ranging from 6 to 50 tokens." and "our STEM-focused Wikipedia dataset was derived from Wikipedia content under CC BY-SA 3.0 licensing.
Dataset Splits No The paper states, 'We processed 500 samples for topology, spectral and Fisher information analysis, and a subset of 50 samples for KL divergence analysis. For the temporal RMT analysis, we examined 100 samples across training checkpoints of Pythia models.' While sample counts are provided, explicit dataset splits for training/validation/test for a model are not, as the paper analyzes pre-trained models. The specific selection methodology or seed for these samples is not detailed for exact reproduction.
Hardware Specification Yes All experiments were conducted using Google Colab with T4 or A100 GPUs.
Software Dependencies No The paper describes various analytical methods (e.g., topological analysis with persistent homology, spectral graph analysis, KL divergence) and mentions using the Ripser algorithm for persistent homology, but does not provide specific version numbers for any software libraries or dependencies used in their implementation.
Experiment Setup Yes We computed these matrices at multiple thresholds (0.001, 0.005, 0.01, 0.02, 0.05, 0.1, 0.2), measuring: i) Algebraic connectivity... ii) Star-likeness... iii) Gini coefficient... iv) Degree centralization... Our implementation identified attention sinks using percentile thresholds (0.8, 0.9, 0.95), measuring the changes in KL divergence when attention sinks were removed...